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TECHNOLOGY  REVIEW: 

SPEECH  RECOGNITION  FOR  LANGUAGE  SUSTAINMENT 


Summary 


The  Technology  Review  for  Speech  Recognition  for  Language  Sustainment  was  an 
effort  of  the  Special  Operations  Research,  Development  and  Acquisition  Center  (SORDAC), 
the  U.S.  Army  Research  Institute  (ARI),  and  the  Advanced  Research  Projects  Agency 
(ARP A)  in  cooperation  with  the  U.S.  Army  Special  Operations  Command  (USASCDC) 
Language  Office.  The  purpose  of  the  workshop  was  to  review  the  state-of-the-art  in 
continuous  speech  recognition  as  it  applies  to  foreign  language  training,  sustainment,  and 
enhancement  Applications  to  Special  Operations  Forces  (SOF)  were  the  focus  of 
presentations  and  discussions.  The  workshop  was  held  on  August  2  and  3,  1995,  in 
Fayetteville,  NC  (Appendix  A  contains  the  agenda). 

The  review  addressed  short-term,  intermediate,  and  long-term  goals  for  applying 
technology  to  SOF  language  training/sustainment  needs.  It  looked  at  what  is  available  now  or 
can  be  produced  in  the  short  term  (1  year)  with  available  technology;  what  can  be  done  to 
meet  SOF’s  needs  in  the  mid-term  by  developing  and  exploiting  advanced  technologies  (2  to 
3  years  out);  and  what  to  plan  for  from  emerging  technologies  in  longer-term  research  and 
development  (5  to  20  years  out).  Presenters  included  major  developers  of  continuous  speech 
recognition  systems  with  demonstrated  interest  in  language  education,  ranging  from  industry 
to  academia.  They  showed  a  variety  of  multilingual  systems,  some  directly  addressing 
language  training  and  others  readily  adaptable  to  training  and  sustainment  (Appendix  B).  In 
addition,  participants  discussed  speech  translation  technology  (Appendix  C)  and  its  links  to 
language  training  technologies.  While  the  focus  of  the  review  was  SOF,  representatives  of 
other  military  and  government  user  groups  also  attended  (Appendix  E  lists  the  participants). 

First  Day  Focus:  Training  and  Sustainment 

The  first  major  presentations  of  the  day  were  by  representatives  of  the  Special 
Operations  Forces  (SOF)  at  Ft.  Bragg.  LTC  Victor  Kjoss,  Chief  of  Training  Division, 
DCSOPS,  USASOC,  overviewed  the  structure  and  missions  of  SOF  and  the  role  of  foreign 
language  skill  in  performins  those  missions.  LTC  H.  Eugene  Williams,  :)rd  Battalion,  1st 
Special  Warfare  Training  Group,  JFK  Special  Warfare  Center  and  School,  presented  the 
school  perspective  on  issues  in  initial  language  training.  LTC  Robert  Brady,  G-3  Special 
Forces  Command,  spoke  on  issues  in  language  sustainment  and  enhancement  from  the 
perspective  of  the  SOF  Groups. 

To  begin  the  technology  review.  Dr.  Cliff  Weinstein  of  MIT  Lincoln  Laboratory 
overviewed  applications  of  speech  recognition  technology  (voice-based  speaker  identification, 
language  identification,  command  and  control,  large  vocabulary  dictation,  etc.)  and  described 
rapid  growth  over  the  past  decade  in  the  rates  of  recognition  accuracy  and  the  size  of 
recognition  vocabularies.  For  example,  recognition  of  read  speech,  spoken  continuously 
without  pauses  (known  as  continuous  speech  recognition)  has  progressed  trom  vocabularies  of 
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5K  words  to  vocabularies  of  60K  words,  with  accuracy  rates  in  the  mid-90th  percentile  in 
highest  performing  recognizers.* 

Nine  system  developers  or  groups  then  reviewed  and  demonstrated 
specific  applications  of  speech  recognition  (Appendix  B  presents  descriptions). 

Dr.  Martin  Rothenberg,  Syracuse  Language  Systems,  Inc.  (p.  B-35) 

Dr.  William  G.  Harless,  Interactive  Drama,  Inc.  (p.  B-36) 

LTC  Steve  LaRocca  and  COL  Woody  Held,  U.S.  Military  Academy  (USMA), 
West  Point  (p.  B-37)  ■ 

Dr.  Madeleine  Bates  and  Mr.  Sean  Colbath,  BBN  Systems  and  Technologies 
(p.  B-38) 

Dr.  Victor  W.  Zue,  Dr.  Joseph  Polifroni,  and  Dr.  Stephanie  Seneff, 

Massachusetts  Institute  of  Technology  (MIT)  (p.  B-39) 

Dr.  Marikka  Rypa,  Dr.  Patti  Price,  Dr.  Leo  Neumeyer,  and  Dr.  George 
Chen,  SRI;  with  Dr.  Kathleen  Egan,  Ms.  Helena  Hughes,  Dr.  Mike  Valatka, 
and  Ms.  Jacqueline  Pogany,  CIA  Foreign  Language  Training  Laboratory 
(p.  B-46) 

Dr.  Jack  Mostow  and  Dr.  Maxine  Eskenazi,  Carnegie  Mellon  University 
(CMU)  Robotics  Institute  (p.  B-48) 

Dr.  Jared  Bernstein,  Entropic  Research  Laboratory,  Inc.  (p.  B-49) 

Dr.  John  T.  Lynch  and  Dr.  Beth  Carlson,  MIT  Lincoln  Laboratory  (p.  B-50) 

The  technologies  applied  ranged  from  lower-end  systems  using  commercial 
off-the-shelf  (COTS)  recognizers  that  process  discrete  speech  (single,  fixed  words  and 
phrases)  to  higher-end  systems  using  prototype  recognizers  that  handle  continuous  speech 
(variable  unerances,  spoken  naturally  without  pauses  between  words).  The  applications 
themselves  varied  from  language  tutoring  to  dictation  to  speech-activated  database  query. 

The  review  included  systems  for  purposes  other  than  tutoring,  as  well  as  systems 
implemented  in  English  rather  than  foreign  languages,  so  as  to  demonstrate  fully  the  pow 
speech  recognition  technology  and  to  suggest  the  range  of  ways  it  might  be  deployed  for 
foreign  language  sustainment.  Languages  in  which  recognizers  were  implemented  includ 
English,  Spanish.  French,  German,  Italian,  Japanese.  Chinese,  and  Korean. 


*  Briefing  chans  and  papers  are  presented  in  the  appendices.  References  in  parentheses  cite  the  appendix  ar. 
where  the  material  appears.  Dr.  Weinstein’s  briefing  charts  start  on  page  B-1. 
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Discrete  speech  recognition  engines  have  been  available  as  COTS  items  for  some  time 
and  can  be  purchased  together  with  development  kits  that  let  system  builders  make  their  own 
speech-interactive  applications.  For  example,  the  recognizer  from  Dragon  Systems  underlies 
two  of  the  systems  demonstrated:  the  commercial  product  TriplePlay  Plus!  from  Syracuse 
Language  Systems,  which  teaches  core  vocabulary  in  selected  European  languages,  and  the 
prototype  instructional  packages  from  Interactive  Drama,  which  combine  speech  recognition 
with  interactive  video.  The  "talkie"  language  lessons  designed  by  the  USMA  use  the 
commercially  available  Aria  Listener  software  to  support  vocabulary  building  as  well  as 
pronunciation  training  on  foreign  word  pairs  that  are  confusing  to  learners. 

Continuous  speech  recognition  (CSR)  engines  have  been  used  largely  in  research 
prototypes.  Several  of  the  systems  included  in  the  review  showed  the  power  of  CSR 
technology  for  authentic  tasks  in  which  users  speak  at  natural  rates,  without  pauses  between 
words,  with  some  freedom  of  expression,  and  without  having  to  train  the  recognizer  on  their 
particular  voice.  Tasks  included  Wall  Street  Journal  dictation  (BBN),  map  navigation  (MIT), 
and  air  travel  information  queries  (MIT,  BBN).  For  example,  MIT’s  Voyager  allows  users  to 
ask  in  Japanese  the  location  of  various  sites  within  an  American  city.  The  system  answers  by 
highlighting  the  sites  on  a  map  of  the  city  as  well  as  by  voicing  a  description  of  the  location, 
in  the  user’s  choice  of  Japanese  or  English.  Queries  are  unconstrained  -  that  is,  users  are  not 
told  in  advance  what  to  say  or  how  to  say  it.  Moreover,  the  system’s  estimation  of  what  the 
user  said  is  displayed  on  the  screen.  BBN’s  Air  Travel  Information  System  demonstrated  a 
similar  functionality  for  English  questions  about  flight  schedules  and  other  travel  information. 
The  point  was  made  that  tasks  like  these  can  serve  language  sustainment  by  providing  a 
simulated  world  in  which  the  learner  uses  the  target  language  to  solve  realistic  problems 
typical  of  SOF  missions. 

The  remaining  CSR-based  systems  were  developed  specifically  for  language 
instruction,  including  the  Voice  Interactive  Language  Training  System  (VILTS)  of  SRI,  the 
LISTEN  tutor  from  Mostow  at  CMU,  and  the  demonstrations  by  Bernstein  from  Entropic 
Research  Laboratory  as  well  as  by  Lynch  and  Carlson  from  Lincoln  Laboratory.  VILTS 
showed  the  precision  of  CSR  technology  for  modeling  learners’  pronunciation  and  tor 
diagnosing  departures  from  native  pronunciation  in  French.  The  system  also  showed  how 
databases  developed  for  speech  recognition  can  be  further  exploited  for  listening 
comprehension,  where  learners  can  request  to  hear  a  particular  word  or  idiom  pronounced  by 
different  speakers  in  different  utterance  contexts.  Mostow’s  LISTEN,  developed  to  teach 
beginning  readers  of  English,  detects  the  words  readers  have  trouble  with  and  coaches  them 
on  the  fly  with  hints  and  corrections  as  misreadings  occur.  Demonstrating  the  flexibility  of 
the  CSR  approach,  LISTEN  generalizes  to  new  texts  without  specific  new  training.  SOF 
representatives  viewing  this  demonstration  suggested  an  immediate  use  for  a  foreign  language 
LISTEN  to  coach  personnel  tasked  with  briefing  foreign  nationals  in  the  native  language. 
Bernstein  demonstrated  CSR  programs  for  automatically  assessing  spoken  language  fluency  as 
well  as  for  communicative  language  instruction,  in  which  learners  describe  a  picture  or  direct 
an  animated  event  in  Spanish.  Lincoln  Laboratory  demonstrated  a  lesson  based  on  ARI  s 
Military  Language  Tutor  (MILT)  in  which  the  learner  poses  questions  in  Spanish  to  a 
modeled  person  who  responds  with  prerecorded  utterances  in  Spanish.  The  applications  of 
both  Lincoln  Laboratory  and  Bernstein  employ  the  HTK  continuous  speech  recognizer 


marketed  by  Entropic,  the  highest  performer  in  terms  of  accuracy  rates  in  a  sequence  of 
ARPA  competitions. 

The  discrete  recognition  systems  of  Syracuse  Language  Systems,  Interactive  Drama, 
and  the  USMA  all  run  on  conventional  PC  platforms  (486  machines).  They  are  intended  as 
speaker  independent  (that  is,  individual  users  do  not  have  to  train  the  machine  on  their 
voices).  The  continuous  recognizers,  by  contrast,  run  on  workstations  such  as  the  Sparc,  but 
some  of  these  recognizers  are  being  poned  down.  For  example,  the  SPHINX  continuous 
recognizer  from  CMU  has  been  ported  to  a  Pentium-based  laptop  running  under  Windows 
NT,  as  demonstrated  by  Mostow  for  the  reading  coach  LISTEN.  The  HTK  engine  marketed 
by  Entropic  is  being  ported  to  a  486  PC  running  under  Windows  (scheduled  for  the  end  of 
1995).  This  product  includes  a  development  kit  that  can  be  used  to  build  new  CSR 
applications.  While  designed  as  speaker  independent,  many  of  these  recognizers  perform 
better  after  a  short  period  of  adaptation  to  the  individual  speaker. 

Second  Day  Focus:  Speech  and  Text  Translation 

Dr.  Susann  Luperfoy  from  MITRE  overviewed  the  task  of  machine  translation  and 
what  makes  it  hard.  She  analyzed  the  multiple  aspects  of  language  and  communication  that  a 
computer  program  must  consider  in  order  to  produce  accurate  translations  (p.  C-1). 

Five  system  developers  or  groups  then  reviewed  their  translation  systems.  The  systems 
were  chosen  to  sample  a  range  of  approaches,  from  high-end,  long-term  solutions  to  low-end, 
short-term  solutions.  Two  high-end  systems  addressed  bidirectional,  speech-to-speech 
translation  of  dialogues  between  speakers  of  different  languages.  These  systems  represent 
attempts  to  incorporate  all  the  aspects  of  language  and  discourse  described  by  Luperfoy: 
Waibel  from  CMU  showed  the  JANUS  system  for  translating  between  multiple  language 
pairs,  permitting  any  combination  of  English,  German,  or  Spanish  input  (Korean  and  Japanese 
are  under  development),  with  English,  German,  Spanish,  Korean,  or  Japanese  output  (p. 

C-27).  Language  Systems  Inc.  showed  the  machine-aided  voice  translator  (MAVT), 
sponsored  by  Rome  Laboratory  and  designed  to  translate  between  English  and  Spanish,  with 
extensions  underway  to  Arabic  and  Russian  (p.  C-48).  Both  systems  incorporate  an 
interlingual  approach,  in  which  the  source  language  is  translated  into  an  abstract,  universal 
semantic  representation  (an  interlingua)  before  being  converted  to  the  target  language.  The 
interlingua  provides  maximum  generalizability  to  new  language  pairs.  In  addition,  both 
systems  make  the  translation  problem  tractable  by  focusing  on  a  single  domain:  meeting 
scheduling  (Janus)  and  basic  tactical  interrogation  (MAVT).  Notably,  Janus  was  designed  to 
handle  the  disfluencies  common  in  spontaneous  speech  (pauses,  re-starts,  and  fillers  like 
"urn").  It  collects  large  samples  of  real  conversations  around  the  target  domain  and  then 
models  the  observed  disfluencies  so  they  can  be  systematically  separated  out  when  new 
conversations  are  processed.  By  training  on  large  samples,  Janus  permits  recognition  and 
translation  of  new  utterances  that  have  not  been  specifically  predicted. 

Lincoln  Laboratory  demonstrated  a  bi-directional  Korean-English  translator,  CCLINC, 
that  works  on  text,  thus  eliminating  the  problem  of  speech  recognition  (p.  C-56).  This 
translator  focuses  on  the  domain  of  Naval  operations  messages  and  uses  an  interlingua  for 


4 


extendibility  to  new  repons  (p.  C-57).  TTiese  three  high-end  systems  -  Janus,  MV  AT,  and 
CCLINC  -  currently  run  on  workstations  rather  than  PCs. 

Two  quick-term  approaches  to  translation  were  also  demonstrated.  The  FALCON 
(Forward  Area  Language  Convener)  uses  a  bilingual  word  list  to  perform  word-for-word 
translation  of  a  scanned-in  foreign  language  document  (p.  C-63).  Although  the  resulting 
English  text  is  low  on  conventional  measures  of  accuracy  and  readability,  it  usually  gives 
enough  information  for  the  English-speaking  soldier  in  the  field  to  decide  whether  to  forward 
the  document  to  headquaners  for  full  translation.  The  Army  Materiel  Command  and  the 
Army  Research  Laboratory  are  developing  FALCON  for  the  XVIII  Airborne  Corps. 

Currentiy  available  for  French,  it  is  being  extended  to  other  languages. 

The  Multimedia  Medical  Translator,  demonstrated  by  HMC(AW)  Hesslink,  is  a  suite 
of  nearly  2,000  prerecorded  utterances  in  more  than  40  languages,  available  on  a  CD-ROM 
disk  for  use  in  medical  examinations  (p.  C-74).  The  user  accesses  the  desired  recordings  by 
choosing  from  menus  of  English  questions  and  expressions.  The  corresponding  foreign 
language  utterances  are  then  played  by  the  device.  Questions  are  designed  to  elicit  yes-no 
answers  or  pointing  responses.  Developed  by  the  Naval  Aerospace  and  Operational  Medical 
Institute,  this  program  is  being  used  by  Naval  health  care  staff  supponing  U.N.  operations  in 
the  former  Yugoslavia.  The  program  was  recently  extended  to  training  in  mine  clearing 
operations.  Both  the  Multimedia  Medical  Translator  and  FALCON  run  on  a  PC,  laptop,  or 
notebook  equivalent. 

Systems  for  translation  were  included  in  the  review,  first,  because  SOCOM  has  a 
documented  requirement  for  translation,  both  text-  and  speech-based;  second,  because  many 
of  the  components  developed  for  translation  can  also  support  language  training  and 
sustainment.  Cooperative  agreements  to  share  technologies  already  exist  between  ARI  and 
the  various  agencies  that  support  translation  work. 

Conclusions 

Government  participants  in  the  review  included  scientists  as  well  as  end  users 
representing  SOF,  the  Army  Research  Institute,  ARPA,  the  Army  Intelligence  Center  and 
School,  the  Defense  Language  Institute,  the  Deputy  Chief  of  Staff  for  Intelligence  (HQDA), 
Deputy  Chief  of  Staff  for  Operations  (HQDA),  the  Army  Research  Laboratory.  Army 
Training  and  Doctrine  Command,  Army  Research  Office,  CIA,  NSA,  DCI  Foreign  Language 
Committee,  and  Rome  Laboratory  (Air  Force),  among  other  agencies  (Appendix  E). 
Government  representatives  generally  agreed  that  the  core  technologies  demonstrated  at  the 
review  -  discrete  and  continuous  speech  recognition  -  were  sufficiently  mature  to  support  a 
robust  language  sustainment  tutor  with  which  learners  can  interact  by  speaking.  Moreover,  it 
was  agreed  that  these  technologies  appear  suitable  for  both  pronunciation  training  and  practice 
of  conversational,  communicative  tasks  in  target  languages.  Both  commercial  and  research 
demonstrations  were  credible  in  that  most  permitted  new  and  unpracticed  users  to  interact 
with  the  system  without  significant  performance  deficits. 

At  the  same  time,  it  was  agreed  that  applied  research  and  development  are  needed  to 
shape  the  core  technologies  into  a  product  useful  to  SOF.  Commercially  available  software. 


while  useful  for  global  language  training,  does  not  address  SOF- specific  tasks  and  vocabulary, 
nor  is  it  available  in  the  more  difficult  languages  critical  to  SOF  (e.g.,  Arabic,  Korean,  Thai). 
Moreover,  commercial  language  learning  products  currently  use  discrete  recognition 
algorithms  and  do  not  exploit  the  power  of  CSR  to  process  spontaneous,  variable  utterances. 
Similarly,  research  prototypes,  many  of  which  do  employ  CSR  to  train  language  learning 
skills,  are  not  available  in  high-priority  languages,  nor  do  they  address  task  domains  of 
concern  to  SOF.  Plans  were  made,  then,  to  develop  a  short-term  (1-year)  language 
sustainment  tutor  using  discrete  speech  recognition  and  a  medium-term  (2-year)  tutor  using 
continuous  speech  recognition,  both  addressing  SOF-critical  languages  and  tasks.  Beginning 
in  FY96,  this  development  is  to  be  supported  by  a  joint  program  involving  SOCOM,  ARPA, 
and  ARI,  working  through  the  SOF  Language  Office  and  guided  by  specific  input  from  the 
SOF  Groups,  NAVSOC,  and  AFSOC. 
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TECHNOLOGY  REVIEW:  SPECIAL  OPERATIONS  FORCES  (SOF) 
SPEECH  RECOGNITION  FOR  LANGUAGE  SUSTAINMENT 

AGENDA 


Wednesday  -  2  August  1 995 

0730  Registration  Opens  -  Continental  Breakfast 

0830  Introduction  -  Melissa  Holland  (ARI) 

Gil  Buhrmann  (Office  of  Special  Technology) 

Allen  Sears  (ARPA  Human  Language  Systems 
and  Human  Computer  Interactions) 

Mike  Sanders  (ARI,  Ft.  Bragg) 

0850  SOF  Language  Training  and  Sustainment 

Overview 

LTC  Kjoss,  SOF  Language  Office  (Interservice) 

School  Perspective:  Initial  Language  Training 

LTC  Williams  (JFK  Special  Warfare  Center  and  School) 
Groups  Perspective:  Language  Sustainment 
LTC  Brady  (US  Army  SF  Command) 

Questions  for  SOF 

1 000  Break 

1015  Speech  Recognition  (SR)  State-of-the-Art. 

Cliff  Weinstein  (Lincoln  Lab) 

1 045  Introduction  to  the  Systems:  SR  for  Language 

Training/Sustainment  -  Set  1  and  Set  2  Systems 

1230  Lunch 


A-l 


AGENDA  (Cont.) 


Wednesday  -  2  August  1 995  (Cont.) 

1330  Demonstrations  of  Set  1  Systems 
1510  Break 

1 525  Demonstrations  of  Set  2  Systems 

1710  Summary  and  Announcements  -  Melissa  Holland  (ARI) 

Mazie  Knerr  (HumRRO) 

1730  Reception  with  Cash  Bar 
1900  Dinner 
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AGENDA  (Cont.) 


Thursday  -  3  August  1995 

0730  Continental  Breakfast  (General  Meeting  Room) 

0830  Introduction  -  Melissa  Holland  (ARI) 

0835  Speech  Translation:  Problems  and  Prospects  - 
Susann  Luperfoy  (MITRE) 

0900  Introduction  to  the  Systems:  Translation  and  Speech 
Recognition  -  Set  3  Systems 

0945  Break 

1000  Demonstrations:  Speech  Translation  Systems  -  Set  3  Sy sterns 
1140  Discussion  and  Summary  -  Robert  J.  Seidel  (ARI) 

1200  Adjourn  general  meeting 

Demos  from  August  3  are  available  until  1245 

Notes:  Meetings  on  August  3 

•  ARPA  developers  meet  with  Allen  Seers  from  0700  -  0830 
(Palais  Room) 

•  Government  meeting  with  SOF  representatives  from  1330  -  1530 
(General  Meeting  Room) 
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Technology  Review: 

Special  Operations  Forces  (SOF) 

Speech  Recognition  for  Language  Sustainment 

Dr.  Clifford  Weinstein’s  Presentation 
“Spoken  Language  Technology  and  Applications: 

State-of-the-Art” 
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PHONE:  617-981-7491 ;  FAX  617-981-0186 
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state  of  the  Art:  Example  in  the  ATIS  Domain 
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-  USER  STRESS 
ENIVIRONMENT-RELATED 

-  ACOUSTIC  BACKGROUND 

-  CHANNEL  AND  MICROPHONE  QUALITY 
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simulation  and  training 


HLS  Technology  Transfer  Strategy 
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HUMAN-MACHINE  INTERACTIONS  IN  2020 
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C-STAR II 
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EXAMPLE  OF  SENTENCE  INTERPRETATION 

■  ''™ST  B  COMMANDER 
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Interactive  decision  support  using  dialog 
Impact:  Improved  military  readiness,  affordability,  and  usability 
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Roger:  Logistics  and  Transportation  Anchors  are  set-up 
..  2-way  video  will  cost  the  standard  rate  .. 

..  there  will  be  a  2  minute  wait  for  the  warehouse  info. 
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Planning  J  \Understanding 


SPOKEN  LANGUAGE  TECHNOLOGY  AND 
APPLICATIONS:  STATE-OF-THE-ART 
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SPANISH 

SPEAKING 

OPERATOR 
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Technology  Review; 

Special  Operations  Forces  (SOF) 

Speech  Recognition  for  Language  Sustainment 


Descriptions  of  Speech  Recognition  Systems 


TriplePlay  Plus! 
Dr.  Martin  Rothenberg 


TriplePlay  Plus!,  from  Syracuse  Language  Systems,  is  a  fund  and  effective  way  to  learn 
to  read,  speak,  and  understand  a  foreign  language.  The  unique  Speech  Recognition  mode  in 
TriplePlay  Plus!  bring  language  learning  closer  to  the  natural  way  a  person  learns  a  first  language 
-  by  spoken  interaction. 

TriplePlay  Plus!  features  Speech  Recognition  technology  licensed  firom  Dragon  Systems, 
Inc.,  that  evaluates  the  learner’s  pronunciation.  Speech  Recognition  is  embedded  in  interactive 
games  and  conversations  that  provide  an  engaging  multimedia-immersion  approach  to  language 
learning. 

TriplePlay  Plus!  includes  a  high-quality  dynamic  microphone  for  use  with  the  Speech 
Recognition  and  record/Playback  features.  The  Windows  CD-ROM  is  co-published  by  Syracuse 
Language  Systems,  Inc.,  and  Random  House,  Inc.  as  pan  of  the  Living  Language  MultimediaTw 
product  line. 

Designed  for  learners  age  8  to  adult,  TriplePlay  Plus!  teaches  over  1,000  words  and 
phrases  in  versions  for  learning  Spanish,  French,  German,  English  or  Hebrew.  The  produce  uses 
multimedia  language  immersion,  a  learning  method  developed  at  Syracuse  University,  to  teach 
naturally,  entirely  in  the  language  to  be  learned. 

TriplePlay  Plus!  is  the  winner  of  several  industry  awards,  including  a  1995  HOME  PC 
Editor’s  Choice  Award,  a  1994-1995  Technology  &.  Learning  Award  of  Excellence,  and  a  1994 
NewIMedia  INVISION  Award  for  innovation  in  multimedia. 

Contact:  Dr.  Martin  Rothenberg 

Syrcause  Language  Systems,  Inc. 

719  E.  Genesee  St. 

Syracuse,  NY  13210 

(315)  478-6729/(800)  688-1937;  FAX:  (315)  478-6902 
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Conversim^-A  Dialog  with  a  Native  Speaker  in  a  Multimedia  Environment 

Dr.  William  G.  Harless 


Through  the  creative  application  of  interactive  video  and  speech  recognition  technologies, 
Interactive  Diva’s  Conversim  software  offers  a  unique  approach  to  foreign  language  training: 
Students  learn  to  speak  the  language  through  face-to-face  dialogue  with  native  speakers  in 
simulated  real-life  situations. 

Two  simulations  will  be  presented:  "Medical  Spanish"  and  "Roberto’s  Restaurant."  The 
simulated  character  in  the  medical  Spanish  program  is  an  elderly  real  patient  with  a  history  of 
heart  trouble.  The  simulated  character  in  the  restaurant  program  is  actually  the  charismatic  owner 
of  the  restaurant.  Each  simulation  involves  a  situation  which  requires  that  students  master  words 
and  phrases  in  order  to  manage  the  real-life  situation.  Assisted  by  an  on-screen  native  instructor, 
students  first  learn  and  rehearse  the  vocabulary,  then  they  practice  using  this  vocabulary  in  a 
direct  dialogue  with  the  simulated  character. 

Contact:  Dr.  William  G.  Harless 

Interactive  Drama,  Inc. 

7900  Wisconsin  Avenue,  Suite  200 
Bethesda,  MD  20814 
(301)  654-0676;  FAX:  (301)  657-9174 
e-mail:  INTDRAMA(3)aol.com 
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The  Here  and  Now  in  Voice-Interactive  Language  Learning  Systems 
LTC  Steve  LaRocca  and  COL  Woody  Held 


In  developing  voice-interactive  systems  for  foreign  language  study  at  West  Point,  speech 
recognition  was  added  as  an  enhancement  to  interactive  video  platforms.  The  idea  was  to  make 
existing  language  lessons  "talkies"  by  using  speech  recognition  in  lieu  of  a  keyboard  or  mouse 
to  respond  to  multiple  choice  questions.  The  speech  recognition  technology  used  is  inexpensive 
and  relatively  simple.  The  recognizer  is  used  to  differentiate  between  a  small  number  of 
complete  utterances,  trained  specifically  for  each  lesson.  This  system  adds  vocabulary 
development  to  the  work  of  authoring  lessons,  yet  provides  students  with  courseware  that  uses 
all  four  languages  skills  (listening,  reading,  writing  and  speaking)  and  more  realism  as  well. 
Voice-interactive  systems  at  West  Point  capitalize  on  the  low  cost  ($150)  of  Prometheus  Aria 
16SE  sound  cards  and  the  easy-to-use  Aria  Listener  software.  We  are  working  with  Duke 
University  to  bring  Aria-type  speech  recognition  into  the  WinCALIS  authoring  system. 


Contact:  LTC  Steve  LaRocca 

Center  for  Technology  Enhanced  Language  Learning 

Department  of  Foreign  Languages 

U.S.  Military  Academy 

West  Point,  NY  10996 

(914)  938-5286;  FAX:  (914)  938-3585 

e-mail:  gs0416(2)usma3. usma.edu 
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Speech  and  Language  Technology 

Dr.  Madeleine  Bates  and  Mr.  Sean  Colbath 


We  will  demonstrate  or  show  on  videotape  a  number  of  systems  that  illustrate  the  state 
of  the  art  in  speech  recognition  and  language  understanding: 

1.  ATIS  -  an  air  travel  information  system  that  understands  spoken  questions  and 

commands. 

2.  Large  vocabulary  (20,000  words),  real-time,  continuous,  speaker-independent  speech 
recognition. 

3.  Form  filling  via  speech. 

4.  Speaker  identification  -  identifies  which  speaker  form  a  known  set  of  possible  speakers 
is  talking,  very  rapid  enrollment  process,  works  in  any  language. 

5.  VALAD  -  a  system  that  integrates  speech  with  mouse,  menus,  and  keyboard, 
interfacing  to  the  logistics  anchor  desk  and  intended  for  use  by  military  logistical  planners.  The 
resulting  interactive  spoken  language  understanding  system  was  recently  demonstrated  at  Prairie 
Warrior  ’95. 

Contact:  Dr.  Madeleine  Bates 

BBN  Systems  and  Technologies 
70  Fawcett  Street 
Cambridge,  MA  02138 
(617)  873-3634;  FAX:  (617)  547-8918 
e-mail:  Bates@BBN.com 
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Langu<ig6  Tutor  and  Bilingual  Voyager  System 

Dr.  Victor  W.  Zue,  Dr.  Joseph  Polifroni,  and  Dr.  Stephanie  Seneff 
Spoken  Language  Systems  Group 


The  Spoken  Language  systems  Group  will  demonstrate  two  related  systems: 

1,  A  "Language  Tutor"  applied  to  Japanese,  which  provides  users  with  practice  drills 
and  feedback  to  help  them  recall  and  pronounce  Japanese  words  and  phrases  that  will  be  of  use 
in  the  second  demo. 

2.  The  "Bilingual  Voyager"  system,  which  gives  the  user  information  appropriate  for 
a  traveler  in  Cambridge,  Massachusetts  (hotels,  restaurants,  banks,  etc.)  and  locates  places  of 
interest  on  the  map.  The  user  can  converse  with  the  system  in  English,  Japanese,  or  "mixed 
mode"  (e.g.,  user  speaks  in  English,  system  responds  in  Japanese). 

Both  systems  use  a  continuous-speech,  speaker-independent  speech  recognizer.  The 
acoustic  models  were  trained  on  both  read  and  spontaneous  speech  from  native  speakers  in  each 
language.  The  systems  run  on  a  Sun  Sparc  20  workstation. 

Contact:  Dr.  Stephanie  Seneff 

Spoken  Language  Systems,  Group 
Massachusetts  Institute  of  Technology 
Cambridge,  MA  02139 
(617)  253-0451;  (FAX):  (617)  258-8642 
e-mail:  seneff@lcs.mit.edu 
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Research  and  Development  of 
Multilingual  Conversational  Systems 


Spoken  Language  Systems  Group 
Laboratory  for  Computer  Science 
Massachusetts  Institute  of  Technology 

August  2, 1995 


What  Is  a  Conversational  System? 


•  It  not  only  recognizes,  but  also  understands 
verbal  input,  in  order  to  perform  some  tasks 
beyond  dictation  (e.g.,  database  access) 

•  Speech  recognition  technology  must  be 
augmented  with  language  understanding 
technology  (including  syntax,  semantics, 
discourse,  and  dialogue) 

•  The  system  may  have  to  respond  using 
natural  language  (including  spoken  output) 


Conversational  System  Architecture 


i;pown  canquM*  ti«oup 


GALAXY 
(MuIttlInguaL  1994) 


Soo«*n  L*rtgu*9*  >**<•««  ci«>oo 
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Current  Status  at  MIT 


ConversstionsI  systems  ere  emerging  thet  cen: 

•  Deal  with  continuous  speech,  by  unknown  users, 
drawn  from  a  large  vocabulary, 

•  Understand  the  meaning  of  the  utterances  and  take 
appropriate  actions, 

•  Operate  in  reai  (or  realistic)  domains, 

•  Handle  multiple  languages  (English,  Japanese, 
Spanish,  French,  Italian,  German,  Chinese),  and 

•  Deliver  these  capabilities  In  real-time,  using 
standard  workstations  with  no  additional  hardware 


Spoiww  Lanqua^  Syttam*  Gtoup 


Multilingual  Conversational  Systems 
for  Human-Computer  Interactions 


Spoaen  Lanquape  Syateme  lIvouo 
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Semantic  Frame  Representation 


Understand 


ClauM:  LOCATE 

Topic:  PUBLIC-BUILDING 
Quantifier:  DEF 
Name:  library 

Predicate:  NEAR 

Topic:  SQUARE 
Name:  Central 


Paraphrase 


WHERE  IS  THE  LIBRARY  NEAR  CENTRAL  SQUARE 

SENTORARU  SUKUEA  NO  CHIKAKU  NO  TOSHOKAN  WA 
DOKO  DESU  KA 

DOVE  STA  LA  BIBLIOTECA  VICINO  A  CENTRAL 
SQUARE 

OU  SE  TROUVE  LA  BIBLIOTHEQUE  QUI  EST  PRES  DE 
CENTRAL  SQUARE 

Spofw  (.angtMe* 


Multilingual  Conversational  System 


1  i’ 
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T- 

1  -  1 

\[Mod0t3\ 

SPEECH 

RECOGNITION 

LANGUAGE 
UNDERSTANOINQ  j 

Tables 

3oqim<i  Svaiwna  Orawo 
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The  MIT  VOYAGER  System 


•  VOYAGER  Is  a  conversational  system  that 
can  provide: 

-  Navigation  assistance  within  a  region  of 
Cambridge,  MA,  and 

-  Information  about  some  locations  within  this 
region,  such  as  hoteis,  banks,  libraries,  etc. 

•  The  system  can  accept  continuous  speech 
input  from  any  user 

•  It  produces  output  in  the  form  of  grspfilcs, 
text  and  synthetic  speech 

•  It  converses  in  English,  Japanese,  and 
Italian 


Spoiwn  i.«nquS9e  Sy«i«fns  G<oue 


Language  Tutor:  An  Interactive 
Spoken  Language  Learning  Aid 


•  The  system  provides  a  non-threatening,  interactive 
environment  to  help  people  acquire  language 
skills 

•  A  speech  understanding  system  shadows  the  user 
and  provides  feedback  on  pronunciation  skills 

•  It  is  currently  operating  for  English  and  Japanese 


i>ouo 
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A  Novel  Approach  to  Language  Learning 


»  Dovetails  a  language  tutor  with  a  multilingual 
conversational  system  such  as  voyager 

•  Each  lesson  would  consist  of: 

-  Newly  introduced  vocabulary  and  grammar  drills 

-  A  scenario  specifically  designed  for  the  lesson 

•  Students  can  speak  in  their  native  language  and  hear 
responses  in  target  language,  or  vice  versa, 
providing  flexibie  alternatives  for  practicing 
speaking/listening 

•  Enables  students  to  practice  Interaction  in  a  risk-free 
setting 

-  Goes  beyond  mechanics  of  standard  reading/speaking 
exercises. 

-  Simulates  real  world  in  a  language  laboratory. 


Speeee  l.«/>g»*a9e  Sv**e«T*e  urauo 
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Voice  Interactive  Language  Training  System  (VILTS) 

Patti  Price,  Marikka  Rypa,  Leo  Neumeyer,  and  George  Chen 
Research  and  Technology  Laboratory 
SRI  International 

Mike  Valatka  and  Kathleen  Egan 
Office  of  Research  and  Development,  CIA 

Helena  Hughes 

Federal  Language  Training  Laboratory,  CIA 
Jacqueline  Pogany 

Office  of  Training  and  Education,  CIA 


1.0  Overview 

The  Voice  Interactive  Language  Training  System  (VELTS)  is  language  education  software 
being  developed  to  foster  improvement  in  French  comprehension  and  speaking  skills.  VILTS 
represents  a  joint  development  effort  between  SRI  International,  the  Office  of  Training  and 
Education  (OTE),  and  the  Federal  Language  Training  Laboratory  (FLTL).  The  focus  of  the 
program  is  to  train  students  at  levels  1  through  3  in  comprehension  and  discrimination  skills  and 
subsequently  in  speaking  and  pronunciation  skills  through  a  series  of  activities  centered  around 
listening,  speaking,  and  reading.  SRI  is  incorporating  advances  in  its  research  in  speech 
recognition  and  pronunciation  evaluation  to  provide  students  with  the  opportunity  to  navigate 
through  a  unit  using  oral  communication,  with  the  system  recognizing  appropriate  or 
inqjpropriate  responses.  At  the  end  of  a  unit,  the  student  will  be  given  feedback  as  to  how  s/he 
compares  to  a  native  speaker,  and  additional  feedback  on  specific  problematic  sounds. 
Pronunciation  exercises  will  be  provided  that  target  specific  problem  areas  tailored  to  specific 
student  needs. 

The  present  system  under  demonstration  uses  French  speech  recognition  capabilities;  the 
evaluation  capabilities  are  scheduled  to  be  included  in  early  1996. 

2.0  Speech  Recognition  and  Speech  Evaluation 

As  a  leader  in  speech  technology,  SRI  has  conducted  world-class  research  in  speech 
recognition,  pronunciation  evaluation,  and  speech  processing  capabilities  as  applied  to  language 
education.  SRI  has  consistently  scored  among  the  top  contenders  in  the  ARPA-sponsored  speech 
benchmark  competitions  in  the  last  10  years;  SRI’s  speaker-independent  technology  can  recognize 
natural,  continuous  speech  without  requiring  the  user  to  train  on  the  system.  The  VDLTS 
represents  a  pioneering  effort  to  combine  the  power  and  robustness  of  state-of-the-art  speech 
recognition  with  pedagogically  engaging  learning  activities  and  feedback  on  individual 
pronunciation. 

2.1  Speech  Recognition  Activities 

The  student  interacts  with  the  system  orally  to  simulate  natural  conversation  by 
responding  to  questions  or  posing  questions  to  the  system. 
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As  student  speech  is  elicited  through  a  variety  of  acdvities,  the  French  speech  recognizer 
listens  for  the  oral  student  input  and  responds  appropriately,  either  accepting  or  rejecting  the 
response,  depending  on  a  threshold  level  of  acceptance.  The  extent  to  which  the  student  or 
instructor  can  determine  this  level  of  acceptance  is  an  area  of  future  investigation. 

2.2  Speech  Evaluation 

As  the  student  completes  a  unit  and  enough  speech  has  been  collected,  pronunciation 
evaluation  algorithms  will  be  employed  to  compare  the  student  performance  level  to  the 
pronunciation  of  a  native  speaker.  Ratings  from  expert  French  instructors  are  being  collected  as 
part  of  this  development,  and  the  ratings  by  machine  will  correlate  with  the  expert  raters.  As  a 
result  of  evaluation  scores  and  subscores,  the  system  will  suggest  and  provide  exercises  to 
improve  a  student’s  problem  areas. 

3.0  Pedagogical  Architecture 

The  design  and  development  of  the  Voice  Interactive  Language  Training  System 
represents  a  collaboration  between  SRI  International,  the  Office  of  Training  and  Education,  and 
the  Federal  Language  Training  Laboratory.  The  units  and  activities  are  being  developed  by 
instructional  design  professionals  at  all  three  institutions;  FLTL  is  developing  the  graphics  which 
are  being  integrated  into  the  program  by  SRI. 

Using  spontaneous,  unscripted  French  conversations  on  various  topics  and  excerpts  from 
the  French  newspaper  LeMonde,  the  VILTS  program  provides  the  student  with  authentic, 
unrehearsed  French  speech  as  might  be  heard  in  everyday  speech  in  France.  The  conversations 
are  the  basis  for  the  activities,  which  focus  on  comprehension,  speech  production,  and 
pronunciation.  These  units  can  be  used  to  complement/supplement  a  course  for  students  learning 
French,  or  they  can  be  used  to  support  maintenance  training,  self-study,  and  refresher  programs. 

Conversations  on  ten  different  topics,  including  such  areas  as  travel,  health  care, 
education,  and  politics  were  collected  from  a  pool  of  100  native  speakers  of  French.  A  read 
version  of  these  conversations  was  subsequendy  recorded  by  the  same  speakers  so  that  both 
spontaneous  speech  and  a  clearer  and  slower  version  is  available  to  the  student.  Convereanons 
were  collected  to  approximate  three  distinct  levels  of  student  ability;  beginning,  intermediate  and 
advanced,  corresponding  roughly  to  government  standard  levels  1,  2,  and  3.  The  student  chases 
a  level  of  conversation  with  which  to  work,  and  then  chooses  from  a  menu  of  topics  available 
at  that  level.  Each  lesson  contains  activities  centering  on  listening,  speaking,  and  reading. 

Contact:  Dr.  Patti  Price 

SRI  International 
333  Ravenswood  Avenue 
Menlo  Park,  CA  94025 
(415)  859-5845;  FAX;  (415)  859-5984 
.  e-mail:  pprice(S)speech.sri.com 
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Project  LISTEN: 

A  Reading  Coach  That  Listens 

Dr.  J.  Mostow,  Dr.  M.  Eskenazi,  Dr.  A.  Hauptmann, 

Dr.  B.  Milnes,  and  Dr.  S.  Roth 

Project  LISTEN  is  developing  a  novel  weapon  against  illiteracy  —  an  automated 
reading  coach  that  displays  a  story  on  a  computer  screen,  listens  to  a  student  read  it  aloud, 
and  helps  where  needed.  The  coach  provides  a  combination  of  reading  and  listening,  in 
which  the  student  reads  wherever  possible,  and  the  coach  helps  wherever  necessary.  The 
coach  was  demonstrated  at  ARPA’s  1994  Human  Language  Technology  Workshop,  featured 
in  BYTE’s  cover  story  on  "7  New  Ways  to  Learn,"  and  honored  with  the  Outstanding  Paper 
Award  at  the  1994  National  Conference  on  Artificial  Intelligence. 

Problem:  Literacy  is  essential  to  economic  and  military  effectiveness  in  the 
Information  Age.  For  example,  both  industry  and  military  need  a  pool  of  recruits  who  can 
read  and  comprehend  manuals  for  high-tech  equipment.  Illiteracy  costs  the  United  States 
over  $225  billion  dollars  annually  in  corporate  retraining,  lost  competitiveness,  and  industrial 
accidents.  People  with  low  reading  proficiency  are  often  unemployed,  poor,  or  incarcerated. 
A  reading  coach  that  listens  could  give  millions  of  American  children  and  adults 
individudized  reading  assistance  that  teachers  cannot  provide. 

Approach:  Project  LISTEN  exploits  an  oppormnity  created  by  advices  in  speech 
technology,  reading,  and  human-computer  interaction.  The  reading  coach  adapts  Carnegie 
Mellon’s  state-of-the  art  Sphinx-II  speech  recognizer  to  analyze  the  student’s  oral  reading. 

.  The  coach  responds  with  assistance  modelled  after  expert  reading  teachers.  Successive 
prototypes  have  been  tested  on  approximately  100  children  in  Pittsburgh  public  schools.  To 
go  from  prototype  to  practice,  the  coach  must  be  deployed  in  schools,  evaluated  in  actual  use, 
and  refined  into  a  practical  educational  tool. 

Impact:  Project  LISTEN  offers  a  powerful  new  tool  to  combat  the  literacy  crisis  that 
threatens  the  nation’s  economic  and  military  security.  Second,  as  one  of  the  first  "stress  tests" 
of  real-time  continuous  speech  recognition  in  a  real  application.  Project  LISTEN  provides 
valuable  technical  lessons  about  how  to  make  spoken  communication  with  computers  usable 
and  robust.  Finally,  applications  to  defense  needs  include  more  cost-effective  reading 
instruction  for  ihz  95,000  children  enrolled  in  Department  of  Defense  Dfnrndents  Schools. 
Spinoff  applications  include  individualized  foreign  language  training  for  Special  Forces 
personnel. 

Contact:  Dr.  Jack  Mostow,  Director 

Project  LISTEN 

Carnegie  Mellon  University  Robotics  Institute 
215  Cyert  Hall,  4910  Forbes  Avenue 
Pittsburgh,  PA  15213-3890 
(412)  268-1330;  FAX:  (412)  268-6298 
Internet:  mostow(a)cs.cmu.edu 
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Entropic  Speech  Technology  in  Language  Education. 
Dr.  Jared  Bernstein 


Entropic  Research  Laboratory  has  formed  a  Language  Systems  Group  to  develop 
algorithms  and  build  products  for  language  instruction  and  evaluation.  Entropic’s  existing  Unix 
software  products  provide  the  base  technology  for  Interactive  Spoken  Language  Educauon 
(ISLE).  Entropic  offers  systems  and  tools  to  support  high-accuiacy  speech  recognmon  for  large 
vocabularies,  and  for  manipulation,  storage  and  synthesis  of  high-quality  speech.  Entropic  s  core 
products  are  advanced  signal  processing  software  and  virtual  instruments  for  the  research  an 
development  community.  Over  400  R&D  groups  conduct  their  research  and  build  products  with 

Entropic  tools. 

Fluency  Demonstration  System  (English):  Spoken  English  can  be 
aligned  with  corresponding  text  and  used  to  automatically  judge  the  speaker  s  fluency. 

Picture  Demonstration  System  (Englisli/Spanish);  An  example  of  robust,  tolerant  speech 
recognition  in  a  multiple  choice  exercise. 

Animation  Demonstration  System  (English/Spanish):  An  example  of  interaction  in 
Spanish  or  English  to  control  animated  events. 

Entropic  Time  Scale  Modification  (language  independent):  Software  that  slows  down  or 
speeds  up  recorded  speech  without  distortion. 

The  following  pages  describe  the  Entropic  program. 

Contact:  Dr.  Jared  Bernstein 

Language  Systems  Group 
Entropic  Research  Laboratory,  inc. 

1040  Noel  Drive 
Menlo  Park,  CA  94025 
(415)  328-8877;  FAX:  (415)  328-8866 
e-mail:  jared@entropic.com 
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Foreign  Language  Dialog  System 

Dr.  John  T.  Lynch  and  Dr.  Beth  Carlson 


The  FOREIGN  LANGUAGE  DIALOG  SYSTEM  is  a  speech  recognition-based 
tool  for  providing  a  novice  language  learner  with  authendc  practice  in  speaking 
and  listening  to  a  second  language.  The  tool  also  can  provide  a  convenient  way  to  maintain 
one’s  language  skills.  We  have  developed  a  proof-of-concept  demonstration  system  using 
UNIX-based  research  software  in  order  to  illustrate  the  potential  for  providing  an  environment 
where  a  learner  can  focus  on  the  immediate  communication  task  as  opposed  to  a 
memorization  or  drill  exercise.  The  DIALOG  SYSTEM  therefore  complements  foreign 
language  instruction  whether  it  involve  machine  or  human  interaction.  The  DIALOG 
SYSTEM  would  ideally  be  integrated  with  other  instruction  so  that  the  vocabulary  and 
grammar  of  the  DIALOG  SYSTEM  would  match  the  requirements  of  the  learner  at  a 
particular  stage  of  progress.  In  addition,  the  content  of  the  DIALOG  SYSTEM  s  scenarios 
could  be  matched  to  the  specific  needs  of  the  learner,  e.g.,  food  distribution,  heath  care,  or 
combat  operations. 

The  DIALOG  SYSTEM  is  designed  with  the  following  three  principles  in  mind. 

1.  To  engage  the  learner  more  fully,  the  learner’s  speech  should  determine  the 
system  response. 

2.  To  be  realistic,  the  exact  wording  (vocabulary,  grammar)  should  be  open  and  not 
constrained  by  the  system. 

3.  To  improve  the  accuracy  of  the  speech  recognition  system  so  that  the  system  is 
useful,  the  intention  and  meaning  of  the  learners  utterances  should  be  constrained.  This  can  be 
done  by  context  defined  by  the  scenario. 

Our  system  addresses  these  principles  by  having  the  learner  address  verbal  questions 
to  a  person  represented  on  the  screen.  Our  present  system  uses  clip-art  images  but  future 
versions  would  use  photographs  or  motion  video  of  native  speakers  which  would  further 
enhance  the  immersion  experience. 

The  demonstration  system  is  based  on  a  security  interview  scenario.  To  help  guide  the 
learner,  the  system  provides  a  form  to  be  filled  out  for  the  subject  who  is  being  interviewed. 
This  form  would  specify  an  issue  such  as  "foreign  travel"  but  would  leave  unspecified  how 
the  learner  would  elicit  the  necessary  information  from  the  subject  being  interviewed.  That 
is,  the  system  would  respond  to  a  variety  of  wordings  (expected  of  the  learner  at  a  specified 
level  of  language  achievement).  To  funher  aid  the  learning  process,  the  system  can  also 
provide  suggestions  on  how  to  formulate  each  question,  if  the  user  requests  such  information. 
Other  scenarios  are  easily  envisioned:  admission  to  a  hospital,  interrogation  of  a  suspected 
spy,  ordering  and  planning  distribution  of  food  supplies.  We  plan  to  provide  tools  so  that 
language  instructors  can  easily  develop  scenarios  matched  to  the  needs  of  their  training 
programs. 
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The  current  proof-of-concept  system  is  implemented  in  three  languages:  English, 
Spanish,  and  German. 

The  English  system  has  two  characters  to  be  interviewed  and  they  can  each  be  asked 
25  questions  in  a  variety  of  wordings.  For  example,  one  can  ask:  "Have  you  been  overseas 
recently?",  or  "Any  overseas  travel  in  the  last  3  years?"  The  English  speech  recognmon 
system  was  trained  on  a  general  speech  corpus  called  TIMIT  which  consists  of  about  4  hours 
of  studio-quality  phonetically  rich  speech. 

The  Spanish  system  has  two  characters  who  can  be  asked  five  questions  each  with  a 
number  of  varients  per  question  type.  The  Spanish  speech  recognizer  was  trained  on  data 
collected  from  8  male  and  8  female  talkers  who  varied  from  native  speakers  to  expenenced 
learners  to  novice  speakers.  The  German  system  has  one  character  who  can  be  asked  five 
questions  (with  two  wordings  each). 

The  German  recognizer  was  trained  on  data  collected  from  3  males  and  3  females  who 
are  novice  to  medium  experienced  speakers.  Our  long  term  plans  include  providing  tools  so 
that  language  instructors  could  port  the  system  to  new  languages  by  collecting  appropriate 
data  and  training  new  speech  models.  While  training  data  collection  is  not  always  desirable,  it 
is  often  necessary  for  less  common  languages  for  which  suitable  data  is  not  easily  obtained. 

The  system  demonstrated  can  run  in  real-time  on  both  a  SPARC  10  UNIX  workstation 
and  a  486/Pentium-based  personal  computer  running  the  LINUX  operating  system.  The 
speech  recognizer  software  is  based  on  HTK  (Hidden  Markov  Model  Toolkit),  which  is 
commercially  available  through  Entropic,  and  uses  a  continuous  speech  recopmon  algonthm 
with  a  language  grammar.  Modifications  were  made  to  the  recognition  algorithm  to  accept 
live  speech  input  and  to  interact  with  the  graphical  user  interface  (GUI).  The  GUI  is  based  on 
the  MOTIF  X- WINDOWS  programming  software.  The  current  configuration  of  the  system 
uses  several  research  components  that  are  combined  through  the  use  of  data  pipes  and  shell 
scripts.  Future  general  system  design  improvements  are  needed  to  increase  system  and 
response  speed  and  to  improve  the  human  machine  interface.  In  addition,  tunher 
enhancements  to  the  actu^  speech  recognizer  include  modeling  the  speech  of  talkers  at 
various  points  along  the  novice  to  native  continuum.  The  system  could  then  be  responsive  to 
the  level  of  a  particular  learner  and  at  the  same  time  provide  level-specific  pronunciation 
feedback  to  that  learner. 

Contact:  Dr.  John  T.  Lynch 

MIT  Lincoln  Laboratory 
244  Wood  St.  -  Rm  S4-177 
Lexington,  MA  02173-9108 
(617)  981-2746;  FAX:  (617)  981-0186 
e-mail:  jtl@sst.ll.mit.edu 
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Technology  Review: 

Special  Operations  Forces  (SOF) 

Speech  Recognition  for  Language  Sustainment 

Appendix  C; 
Speech  and  Text  Translation  Systems 


Technology  Review: 

Special  Operations  Forces  (SOF) 

Speech  Recognition  for  Language  Sustainment 


Dr.  Susann  Luperfo/s  Presentation 
"Voice-to-Voice  Machine  Translation: 

Problems  and  Prospects" 
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Discourse  for  Interactive 
Speech-to-Speech  Machine  Translation 
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Potential  Application  of  Verbmobil 
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The  Three  Dialogue  Types 
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Multimodal  HCI  Dialogue 


Address  Ellicitation  in  Samples 


r 


C-ll 


C-12 


in 

Q 


ifi 

l2i 


C 
CD 

E 

O) 

qj- 
CO 

CD 
O) 

o 

t5 

c 
.o 

I  >  ^ 


.o 

o 

-Q 

I 

"O 

1 

(D 

Q 


(U 

o 

(Q 

E 

CQ 


(/) 

0) 

(0 

(Q 

E 

o 


>» 

u 

(n 

b. 

3 

O 

o 

(D 


■o 


(C 

o 

rnmm 

fS 

E 

RJ 


(0 

0 

(A 

re 

E 

u 

c 

> 

o 

re 

3 

O 

o 

re 


0 

0 

u. 

U) 

0 

Q 


(0 

c 

o 

43 

re 

4irf 

o  re 

“I 

re  □ 

.E  c 

X  I 
0 


c 

0 

E 

■c 

re 

Q. 

0 

Q 


0  *3 
•=  O 

Pa 

A  = 

! 

O)  £ 

=  ^ 
»»  “ 

E  =  . 

re  ■” 
</)  re 

c  S, 

re  ■£ 
re  t" 

re  o 

V's 

5  re 


w  «  c 

S  a.£ 
a  u  o 


0 


CO 

0 

:3 

c^.  c 

l-S- 

•S  o 


0 


CO 

0 


c 

-  4-  C 

0  0 


s 


1 

o 


^§- 

A  S  ^ 
0  S 


re  0  w 

Ts  Q.  <1* 
y,  V  3 
^  oi 


0.0 


ft  ^  ^ 

.2  0  « 

^  ^  •Is 

0  0  “O 

E  CO  JQ 

iis 

1^1  2 

O)  o  o 

c  resj= 

^  Q.  o  o 

re 

□. 


c 

0 

a. 

re 

3 

re 


o 

Q. 


re 

E 

0 

0 
0 .9^ 
>  "o 

O  0 

3  H.> 

o  o 
?  = 

U 

m 


i 

■ 


■o 

c 

re 

O) 

c 

■£ 

0 

a 

O 


C-13 


Closing  indicated  at  discourse  level 
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-  users  needed  a  way  to  say  “done”  after  each 
input 

Half-duplex  transmission  was  too  slow,  even  with 
fastest  possible  intelligent  agent 


Error  Recovery 
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the  vision  of  those  operational  planners  into  a  specification. 
While  we  make  three  evaluations  during  operational  testing, 
the  last  two  could  have  been  made  years  earlier  if  the  system 
or  some  approximation  of  it  had  been  available.  In  the  sense 
that  two  of  the  three  evaluations  have  experienced  long  and 


Sample  Discourse:  Spoken  Dialogue 
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Run-Time  Protocol  for  Communication 
Between  Discourse  and  Speech 
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Real-time  Japanese-English  Dialogue 
Translation 
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Discourse  Utt-2 
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Discourse  Ull-2 


Distributed  Simultation  Interface 
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JANUS:  Spontaneous  Speech  to  Speech  Translation  Environment  Technology 


Dr.  Alex  Waibel 
Dr.  Arthur  E.  McNair 


The  JANUS  system  will  be  demonstrated  in  two  forms:  as  a  translating  videophone 
using  workstations,  and  as  a  portable  translation  unit  on  a  PC  laptop.  The  demonstrated 
Hnmain  for  translation  is  a  scheduling  task  (communication  between  two  humans  to  agree  on 
a  time/date  to  meet),  though  ail  technologies  used  are  applicable  to  any  domain,  with  effort 
currently  required  only  to  retrain  the  recogni^r  and  build  grammars  for  a  new  task  or 
language  (any  overlap  in  tasks,  such  as  dates,  allows  direct  reuse  of  portions  of  grammars). 
The  technologies  demonstrated  in  JANUS  include  a  spontaneous  speech,  speaker  independent 
recognizer  which  can  be  trained  for  any  language  (currently  English,  German,  Spanish,  and 
Korean).  Also  used  is  a  text-to-text  translation  system  which  uses  hand-written  grammars  to 
parse  input  language  text,  and  then  generates  text  in  multiple  output  languages  (currently 
English,  German,  Spanish,  Korean,  and  Japanese).  Our  current  specialties  include 
spontaneous  speech  recognition,  multiple  parsing/generation  technologies  (including  automatic 
grammar  generation),  non-standard  modes  of  human  input  to  computers  (speech,  touch, 
handwriting,  visual),  and  the  combination  of  multiple  input  modalities  in  single  applications. 

Contact:  Dr.  Alex  Waibel 

School  of  Computer  Science 

Carnegie  Mellon  University 

5000  Forbes  Avenue 

Pittsburgh,  PA  15213 

(412)  268-7676;  FAX:  (412)  268-5578 

e-mail:  ahw(2)cs.cmu.edu 
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Using  Context  in 

Machine  Translation  of  Spoken  Language 

Lori  Levin^  Oren  Glickman+,  Yan  Qu^  Donna  Gates^ 

Alon  Laviet,  Carolyn  P.  Rose^  Carol  Van  Ess-Dykema^,  Alex  WaibeP 
^  Carnegie  Mellon  University  (USA) 

*  U.S.  Department  of  Defense 
lori.levin@nl.cs.cmu.edu 

Abstract:  We  report  on  techniques  for  using  discourse  context  to  reduce 
ambiguity  and  improve  translation  accuracy  in  a  multi-lingual  (Spanish, 

German,  and  English)  spoken  language  translation  system.  The  tech¬ 
niques  involve  statistical  models  as  well  as  knowledge-based  models  in¬ 
cluding  discourse  plan  inference.  This  work  is  carried  out  in  the  contejct 
of  the  Janus  project  at  Ccirnegie  Mellon  University  and  the  University  of 
Karlsruhe. 

1  Introduction 

Machine  Translation  of  spoken  language  encounters  all  of  the  difficulties  of  written 
language  (such  as  ambiguity)  with  the  addition  of  problems  that  are  specific  to  spoken 
language  such  as  speech  disfluencies,  errors  introduced  during  speech  recognition,  and 
the  lack  of  clearly  marked  sentence  boimdauries.  Fortunately,  however,  we  can  take 
advantage  of  the  structure  of  task-oriented  dialogs  to  help  reduce  these  difficulties. 

In  this  paper  we  report  on  techniques  for  using  discourse  context  to  reduce  ambiguity 
and  improve  translation  accuracy  in  a  multi-lingual  (Spanish,  German,  and  English) 
spoken  language  translation  system.  The  techniques  involve  statistical  models  as 
well  as  knowledge-based  models  including  discourse  plan  inference.  This  work  is 
carried  out  in  the  context  of  the  Janus  project  at  Carnegie  Mellon  University  and  the 
University  of  Karlsruhe  ([!]). 

There  has  been  much  recent  work  on  using  context  to  constrain  spoken  language 
processing.  Most  of  this  work  involves  making  predictions  about  possible  sequences 
of  utterances  and  using  these  predictions  to  limit  the  search  space  of  the  speech 
recognizer  or  some  other  component  (See  [2],  [3],  [4],  [5],  [6],  [7],  [8],  [9]).  The  goal 
of  such  an  approach  is  to  increase  the  accuracy  of  the  top  best  hypothesis  of  the 
speech  recognizer,  which  is  then  passed  on  to  the  language  processing  components  of 
the  system.  The  imderlying  assumption  being  made  is  that  design  and  complexity 
considerations  require  that  each  component  of  the  system  pass  on  a  single  hypothesis 
to  the  following  stage,  and  that  tins  can  achieve  sufficiently  accurate  translation 
results.  However,  this  approach  forces  components  to  make  disambiguation  choices 
based  solely  on  the  level  of  knowledge  available  at  that  stage  of  processing.  Thus, 
components  of  the  system  further  down  the  line  cannot  correct  a  wrong  choice  of  an 
earlier  component. 

The  work  reported  in  this  paper  does  not  rely  on  predictions  about  subsequent 
utterzinces  (although  we  use  such  predictions  in  other  work  not  reported  here).  The 
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si:  que  it  parece  el  lunes 


how  do  you  feel  about  Mondayf 


s2:  tal  v«  scria  mejor  en  la  tardc  the  afternoon  is  perhaps  better 

como  a  las  a  las  dos  de  la  tardc  around  two  p.m. 


si:  no 

yo  tengo  toda  la  tardc  ocupada  i  am  busy  all  afternoon 

de  una  a  cuatro  tengo  una  reunion  from  one  o'clock  till  four  o'clock  i  have  a  meeting 


s2:  el  lunes  Monday 

entonces  scria  mejor  cl  jueves  then  Thursday  is  better 


_  Figure  1:  Example  of  Translation  _ 

key  feature  of  our  approach  is  to  allow  multiple  hypotheses  to  be  processed  through 
the  system,  and  to  use  context  to  disambiguate  between  alternatives  in  the  final  stage 
of  the  process,  where  knowledge  can  be  exploited  to  the  fullest.  Since  it  is  infeasible 
to  process  all  hypotheses  produced  by  each  of  the  system  components,  context  is 
also  used  locally  to  prune  out  unlikely  alternatives.  We  describe  four  approaches 
to  disambiguation,  two  of  which  are  sentence-based  and  two  of  which  are  discourse- 
based  in  that  they  take  a  multi-sentence  context  into  account.  We  show  that  the  use 
of  discourse  context  improyes  performance  on  disambiguation  tasks. 


2  System  Description 

Janus  is  a  speech-to-speech  translation  system  currently  dealing  with  dialogs  in  the 
scheduling  domain  (two  people  scheduling  a  meeting  with  each  other).  The  current 
source  languages  are  English,  German,  and  Spanish  and  the  target  languages  are 
English  and  German.  We  are  also  beginning  to  work  with  Korean,  Japanese,  and  other 
languages.  System  development  and  testing  is  based  on  a  collection  of  approximately 
400  scheduling  dialogs  in  each  of  the  source  languages.  Translation  of  a  portion  of  a 
transcribed  dialog  is  shown  in  Figure  1. 

The  main  modules  of  Janus  are  speech  recognition,  parsing,  discourse  processing, 
and  generation.  Each  module  is  designed  to  be  language-independent  in  the  sense 
that  it  consists  of  a  general  processor  that  applies  independently  specified  knowledge 
about  different  languages.  Therefore,  each  module  actually  consists  of  a  processor  and 
a  set  of  language-specific  knowledge  sources.  A  system  diagram  is  shown  in  Figure  2. 

Processing  starts  with  speech  input  in  the  source  language.  Recognition  of  the 
speech  signal  is  done  with  acoustic  modeling  methods,  constrained  by  a  language 
model.  The  output  of  speech  recognition  is  a  word  lattice.  We  prefer  working  with 
word  lattices  rather  than  the  more  common  approach  of  processing  N-best  lists  of 
hypotheses.  An  N-best  list  may  be  largely  redundant  and  can  be  efficiently  repre¬ 
sented  in  the  form  of  a  lattice.  Using  a  lattice  parser  can  thus  reduce  time  and  space 
complexity  relative  to  parsing  a  corresponding  N-best  list.  Selection  of  the  correct 
path  through  the  lattice  is  accomplished  during  parsing  when  more  information  is 
available. 

^Another  approach  being  pursued  in  parallel  in  the  Janus  project  is  described  in  [10] 
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speech  Synthesizer 


Speech  in  target  language 


Figure  2:  Janus  System  Diagram 


Lattices,  however,  are  potentially  inefficient  because  of  their  size.  We  apply  four 
steps  to  make  them  more  tractable  ([?])•  The  first  step  involves  cleaning  the  lattice  by 
mapping  all  non-human  noises  and  pauses  into  a  generic  pause.  Consecutive  pauses 
are  then  adjoined  to  one  long  pause.  The  resulting  lattice  contains  only  linguistically 
meaningful  information.  The  lattice  is  then  broken  at  points  where  no  human  input 
is  recognized  over  a  specified  threshold  of  time  in  the  speech  signal,  yielding  a  set  of 
sub-lattices  which  are  highly  correspondent  to  sentence  breaks  in  the  utterance.  Each 
of  the  sub-lattices  is  then  re-scored  using  a  new  language  model.  Finally  the  lattices 
axe  pruned  to  a  size  that  the  parser  can  process  in  reasonable  time  and  space.  The 
re-scoring  raises  the  probability  that  the  correct  hypothesis  will  not  be  lost  during 
the  pruning  stage.  Each  of  the  resulting  sub-lattices  are  passed  on  to  the  parser,  the 
first  component  of  the  translation  process. 

Parsing  a  word  lattice  involves  finding  all  paths  of  connecting  words  within  the 
lattice  that  are  grammatical.  The  GLR*  ([12],  [13])  parser  skips  parts  of  the  utterance 
that  it  cannot  incorporate  into  a  well-formed  structure.  Thus  it  is  well-suited  to 
domains  in  which  extra-grammaticality  is  common.  The  parser  can  identify  additional 
sentence  breaks  within  eaich  sub-lattice  with  the  help  of  a  statistical  method  that 
determines  the  probability  of  sentence  breaks  at  each  point  in  the  utterance.  The 
output  of  parsing  a  sub-lattice  is  a  set  of  interlingua  texts,  or  ILTs,  representing  all 
of  the  grammatical  paths  through  the  sub-lattice  and  all  of  the  ambiguities  in  each 
grammatical  path.  The  ILTs  from  each  sub-lattice  are  combined,  yielding  a  list  of 
ILT  sequences  that  represent  the  possible  sentences  of  a  full  multi-sentence  turn.  An 
ILT  n-gram  is  applied  to  each  such  list  to  determine  the  probability  of  each  sequence 
of  sentences. 

The  discourse  processor,  based  on  Lambert’s  work  ([14,  15]),  disambiguates  the 
speech  act  of  each  sentence,  normalizes  temporal  expressions,  and  incorporates  the 
sentence  into  a  discourse  plan  tree.  The  discourse  processor’s  focusing  heuristics  and 
plan  operators  eliminate  some  ambiguity  by  filtering  out  hypotheses  that  do  not  fit 
into  the  current  discourse  context.  The  discourse  component  also  updates  a  calendar 
in  the  dynamic  discourse  memory  to  keep  track  of  what  the  speakers  have  said  about 
their  schedules. 

As  processing  continues,  the  N-best  hypotheses  for  sequences  of  ILTs  in  a  multi¬ 
sentence  turn  are  sent  to  the  generator.  The  generation  output  for  each  of  the  N 
hypotheses  is  assigned  a  probability  as  well.  The  generation  output  follows  certain 
forms  and  is  restricted  in  style.  Therefore  a  regular  n-gram  model  can  be  applied  to 
assign  a  probability  to  each  hypothesis. 

The  final  disambiguation  combines  all  knowledge  sources  obtained:  the  acoustic 
score,  the  parse  score,  the  ILT  n-gram  score,  information  from  the  discourse  processor, 
and  a  generation  n-gram  score.  The  best  scoring  hypothesis  is  sent  to  the  speech 
synthesizer.  This  hypothesis  is  also  sent  back  to  the  discourse  processor  so  it  can 
update  its  internal  structures  and  the  discourse  state  accordingly. 

During  translation,  several  knowledge  structures  are  produced  which  constitute  a 
discourse  context  that  other  processes  can  refer  to.  These  knowledge  structures  in¬ 
clude  the  ILT,  the  plan  tree  and  focus  stack,  and  the  dynamically  produced  calendar. 
The  main  components  of  an  ILT  are  the  speech  act  (e.g.,  suggest,  accept,  reject), 
the  sentence  type  (e.g.,  state,  query^if,  fragment),  and  the  main  semantic  frame 
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“Estas  ocupada  el  lunes” 
(Are  you  busy  on  Monday) 


((FRAHB  •BUSY) 

(SKrreiCE-TTPB  •qUEBY-IF) 

(A-SPKECH-ACT  (•KULTIPLE*  •SOCOEST 

•REQUEST-KESPOISE)) 
(SPEECH-ACT  ♦REPUEST-HESPOISE) 

(HIO  ((FBAHE  *100))) 

(UHEI 

((HH  -)  (FIAHK  •SIKPLE-TIHE) 

(SPECIFIEK  DEFIIITE) 
(DAT-OP-HEEI  HOIDAT)))) 


_ Figure  3:  An  Interlingua  Text  (ILT)  _ 

(e.g.,  free,  busy).  An  example  of  an  ILT  is  shown  in  Figure  3.  The  plan  tree  is  based 
on  a  three-level  model  of  discourse  with  discourse,  domain,  and  problem  solving  levels. 
It  shows  how  the  sentences  relate  to  each  other  in  discourse  segments.  The  focus  stack 
indicates  which  nodes  in  the  plan  tree  are  available  for  further  attachments.  Figure  4 
shows  a  plan  tree  at  the  discourse  level.  The  first  sentence,  which  is  a  surface  question, 
is  identified  as  a  Ref-Request  (request  for  information),  a  Suggest-Fonn  (a  possible 
way  of  malfiTig  a  suggestion),  and  finally  part  of  an  Obtain-Agreement-Attempt  (a 
portion  of  the  discourse  in  which  the  two  speakers  attempt  to  come  to  some  agree¬ 
ment).  The  next  sentence  atta(dies  as  a  Self-Initiated-Clarif ication  indicating 
that  this  sentence  makes  the  suggestion  in  the  previous  sentence  more  clear.  The  last 
two  sentences  are  both  Accept -Forms  (acceptance  of  a  suggestion)  which  chain  up 
together  to  a  Response  node  which  then  attaches  to  the  corresponding  suggestion. 
The  Calendar  records  times  which  the  speakers  are  considering,  suggesting,  rejecting, 
etc.  This  is  updated  dynamically  as  the  conversation  progresses.  An  example  of  a 
Cfilendar  is  shown  in  Figure  5.  Procedures  that  resolve  ambiguity  and  select  from 
among  alternative  analysis  can  take  advantage  of  these  knowledge  structures  as  well 
as  simpler  ones  such  as  the  words  in  the  previous  sentence. 


3  Techniques  for  Disambiguation 

Resolution  of  ambiguity  is  important  for  accurate  translation.  Table  1  shows  some 
examples  of  translation  errors  that  are  caused  by  failure  to  resolve  ambiguity  correctly. 
This  section  describes  four  disambiguation  methods  differing  tdong  two  dimensions, 
whether  they  au’e  knowledge-based  or  statistical,  and  whether  they  are  sentence-based 
or  take  discourse  context  into  aw:count.  The  different  types  of  ambiguities  encountered 
in  Spanish-to-English  tramslation  are  summarized  in  Figure  6. 

The  following  subsections  describe  the  disambiguation  methods  that  we  tested. 
Our  sentence-based  disambiguation  methods  are  implemented  within  the  GLR*  parser 
([12]  [13])  and  its  aiccompamying  grammar.  One  method  is  knowledge-based,  involving 
preferences  that  axe  explicitly  encoded  in  grammar  rules.  The  other  is  statistical, 
involving  probabilities  of  actions  in  the  LR  parsing  table.  The  context-based  methods 
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And  later  we  meet 
from  one  to  three. 


Figure  4:  Example  Plan  Tree 


p*yMirt  ~ 

Month:  l  Mnvmhw 
Day-Of-Weefc  \  Wftdne<da7 


Yean  ! 


1 1:45  neutral 

11:45  neutral 

12:00  suggested 

12.*00  accepted 

12:15  accepted 

12:30  accepted 

12:45  suggested 

12:45  accepted 

13:00  suggested 

13KX)  accepted 

•  •  • 

•  •  • 

15:00  neutral 

15:00  neutral 

•  •  • 

•  •  • 

Speaker  1  Schedule  Speaker2  Schedule 


Figure  5:  A  Calendar  Day  Structure 
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Spanish  Input 

Actual  Translation 

Example  1 
si:  hola  Patricia 
como  est^ 

hello  -  Patricia 

How  do  you  feci  about  it? 

How  are  you? 

Example  2 

si:  en  la  tarde  del  miercoies 
s2:  bueno 

dame  un  poquito  de  tiempo  para  re- 

unirme  contigo 

si:  qu^  tal  de  dos  a  cuatro 

s2:  fabuloso 

Wednesday  afternoon 
okay 

give  me  a  little  time  to  meet  with  you 
how  about  from  the  second  till  the 
fourth? 

that  sounds  great 

how  about  from  two  o^clock 
till  four  o'clock? 

Example  3 

si:  asi  que  si  tiene  alguna  hora  en  esos 
dias  sera  mejor 

so  if  you  are  free  at  some  time  —  those 
days  are  better 

to  if  you  are  free  at  tome 
time  on  thote  dayt  -  that  it 
hotter 

Table  1:  Mistr^slations  of  Ambiguous  Sentences 


include  knowledge-based  discourse  plan  inference  and  statistical  N-grams  of  ILTs. 


Parse  Disambiguation  Using  Grammar  Rule  Preferences 

In  order  to  successfully  parse  fragmented  input,  the  grammars  we  use  for  parsing  spon¬ 
taneous  speech  have  very  inclusive  notions  as  to  what  may  constitute  a  grammatical 
sentence.  The  grammars  allow  meaningful  clauses  and  fragments  to  propagate  up  to 
the  top  (sentence)  level  of  the  granunar,  so  that  fragments  may  be  considered  com¬ 
plete  sentences.  Additional  grammar  rules  allow  an  utterance  to  be  analyzed  as  a 
collection  of  several  grammatical  fragments.  The  major  negative  consequence  of  this 
grammar  “looseness”  is  a  significant  increase  in  the  degree  of  ambiguity  of  the  gram¬ 
mar.  In  particular,  utterances  that  can  be  analyzed  as  a  single  grammatical  sentence, 
can  often  also  be  analyzed  in  various  ways  as  collections  of  clauses  and  fragments. 
Our  experiments  have  indicated  that,  in  most  such  cases,  a  less  fragmented  analysis 
is  more  desirable.  Thus,  we  developed  a  mechanism  for  prefering  less  fragmented 
analysis. 

The  fragmentation  of  an  analysis  is  reflected  via  grammar  preferences  that  are  set 
explicitly  in  various  grammar  rules.  The  preferences  are  recorded  in  a  special  counter 
slot  in  the  constructed  feature  structure.  By  assigning  counter  slot  values  to  the  inter¬ 
lingua  structure  produced  by  rules  of  the  grammar,  the  grammar  writer  can  explicitly 
express  the  expected  measure  of  fragmentation  that  is  associated  with  a  particular 
grammar  rule.  For  example,  rules  that  combine  fragments  in  less  structured  ways  can 
be  associated  with  higher  counter  values.  As  a  result,  analyses  that  are  constructed 
using  such  rules  will  have  higher  counter  values  than  those  constructed  with  more 
structurally  “grammatical”  rules,  reflecting  the  fact  that  they  are  more  fragmented. 
Although  used  to  primarily  reflect  preferences  with  respect  to  fragmentation,  the 
same  mechanism  can  be  used  to  express  other  preferences  as  well. 

We  tested  the  disambiguation  performance  of  the  GLR*  parser  using  the  grammar 
preferences  as  the  sole  disambiguation  criterion.  In  this  setting,  for  an  ambiguous 
sentence  that  residts  in  multiple  analysis,  the  parser  chooses  the  analysis  with  the 
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lowest  counter  value.  Ties  between  numerous  analyses  with  equal  minimal  counter 
score  axe  broken  at  random.  This  disambiguation  method  was  tested  on  a  set  of  512 
sentences,  252  of  which  produce  ambiguous  parses.  As  shown  in  Table  2,  the  GLR* 
parser  selected  the  correct  parse  in  196  out  of  the  252  ambiguous  sentences.  This 
corresponds  to  a  success  rate  of  78%. 


Parse  Disambiguation  Using  a  Statistical  Model 

The  grammar  rule  preference  mechanism  can  reflect  preferences  between  particular 
grammar  rules.  However,  it  does  not  provide  a  complete  mechanism  for  disambiguat¬ 
ing  between  the  set  of  all  possible  analyses  of  a  given  input.  This  is  done  by  a 
statistical  module  which  augments  the  parser.  Our  statistical  model  attaches  proba¬ 
bilities  directly  to  the  alternative  actions  of  each  state  in  the  parsing  table.  Because 
the  state  of  the  GLR*  parser  partially  reflects  the  left  and  right  context  within  the 
sentence  of  the  paxse  being  constructed,  modeling  the  probabilities  at  this  level  has 
the  potential  of  capturing  preferences  that  czinnot  be  captured  by  standard  Proba¬ 
bilistic  Context-Free  Grammars.  For  example,  a  reduce  action  by  a  certain  grammar 
rule  A  a  that  appears  in  more  than  one  state  can  be  assigned  a  different  probability 
in  each  of  the  occurrences. 

Training  of  the  probabilities  is  performed  on  a  set  of  disambiguated  parses.  The 
probabilities  of  the  peirse  actions  induce  statistical  scores  on  alternative  parse  trees, 
which  axe  then  used  for  paxse  disambiguation. 

We  tested  the  disambiguation  performance  of  the  GLR*  parser  using  a  combina¬ 
tion  of  the  statistical  parse  scores  and  the  grammar  rule  preference  \uiues.  The  same 
test  set  of  252  aimbiguous  sentences  was  evaluated.  As  can  be  seen  in  Table  2,  the 
combined  disambiguation  method  succeeds  in  selecting  the  correct  paxse  in  209  of  the 
252  cases,  a  success  rate  of  82%. 


Disambiguation  Using  Discourse  Plans 

Our  discourse  processor  is  a  plan  inference  model  based  on  the  recent  work  of  Lambert 
([14,  15]).  The  system  takes  as  its  input  ILTs  of  sentences  as  they  are  uttered  and 
relates  them  to  the  existing  context,  i.e.,  the  plan  tree.  Plan  inferencing  starts  from 
the  surface  forms  of  sentences.  Then  speech-acts  axe  inferred.  Multiple  speech-acts 
for  one  ILT  could  be  inferred.  A  separate  inference  chain  is  created  for  each  possible 
speech  act.  Preferences  for  picking  one  inference  chain  over  another  are  determined 
by  the  focusing  heuristics,  which  provide  ordered  expectations  of  discourse  actions 
given  the  existing  plan  tree.  A  detailed  description  of  the  focusing  heuristics  can  be 
found  in  [16]  and  [17]. 

We  are  currently  conducting  experiments  to  see  how  the  plan  tree  and  focusing 
heuristics  can  help  to  disambiguate  multiple  ILT  outputs  from  the  parser.  We  have 
obtained  some  preliminary  results  concerning  resolving  ambiguities  in  sentence  types 
(statement,  query-if,  query-ref,  fixed-expression,  fragment)  in  the  ILT  out¬ 
puts.  Our  experiments  have  shown  that  the  same  focusing  heuristics,  which  are  useful 
for  picking  the  most  prefered  inference  chain  for  one  ILT,  can  be  used  for  providing 
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Typt  of  Ambiguity  Number  of 
Occurences 


Examples 


Slot 

20 

Apiece  of 
infonnadon  ocean 
in  different  slots  in 
eachILT. 

Value 

162 

The  ILTs  differ  in 
the  value  of  a  slot 

Frame 

136 

The  D-Ts  have 
different  top-level 
frames.  1 

Switence  brtakiM 

46 

The  grammar 
allows  more  than 
one  way  of 
breaking  the  input 
into  sentences. 

Duplicate 

The  parser 
produces  multiple 
identical  ILTs. 

31 

AlltYBSa 

395 

si  esiis  Hfare  d  xnaites  ocho  puedo  leumime  todo  el  dia 
If  you  are  fneonTuesday  the  eighth  Jean  meet  all  day.  or 
If  you  are  free,  on  Tuesday  the  eighth  I  can  meet  all  day.  or 
If  you  are  free  on  Tuesday,  on  the  eighth  I  can  meet  all  day. 

voy  a  estar  afuen  la  semanaque  viene 
I  will  be  out  of  town  the  week  thaTs  coming  up.  or 
I  will  be  out  ^town  the  week  that  you*  re  coming. 

estedia 

thisdayor  umday 

nos  podemos  reunir  a  las  dos 

We  can  meet  at  two.  or  Can  we  meet  at  two? 

nos  reommos  el  vexntitris 

We  will  meet  on  the  twenty  third,  or 

We  met  on  the  twenty  third. 

dosacuatro 

second  at  four  or  second  to  forth  or  two  to  four 
vamosaver 

Lefssee.  or  We  will  check,  or  We  will  see. 
boeno 

Good  or  Well... 
qud  tal 

How  are  you?  or  How  is  that? 
el  dos  es  bueno 

The  second  is  good,  or  It  is  the  second.  Good. 
noestibien 

It  is  not  good,  or  No.  it  is  good. 
qod  bueno 

How  great!  or  What?  Good. 

voy  a  — a  las  dos  probablemente 
I  will  leave  on  the  second  probably. 


el  martes  es  el  dos  de  octubre 
Tuesday  is  the  second  of  October. 


Figure  6:  Types  of  Ambiguities 
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ordered  expectations  for  picking  inference  chsins  from  multiple  ILT  outputs  of  the 
parser. 

The  design  of  the  experiment  is  composed  of  two  steps.  First,  we  try  to  attach 
each  ILT  from  the  set  of  ambiguous  ILTs  of  a  sentence  to  the  existing  dialog  model. 
Second,  the  results  of  attachment  for  each  ILT  are  compared.  The  best  attachment 
is  considered  to  be  the  one  which  best  continues  the  existing  context.  When  multiple 
attachments  are  possible,  the  focusing  heuristics  are  used  to  make  comparisons.  For 
example,  the  sentence  Y  tios  podriatnos  vtuxiiv  a  la  uua  can  be  a  statement  (^Atid  we 
could  meet  at  one)  or  yes- no  question  [And  could  we  meet  at  one?).  The  focusing 
heuristic  prefers  the  statement  because  it  attaches  to  the  current  focus  action,  whereas 
the  question  attaches  to  an  ancestor  of  the  current  focus  action.  The  performance 
result  of  using  plan  tree  and  focusing  strategy  on  sentence  type  ambiguities  is  shown 
in  Table  3. 

From  Table  3,  it  can  be  seen  that  by  using  context  and  the  focusing  heuristics, 
the  discourse  processor  achieves  a  general  performance  of  86%  for  sentence  type  dis¬ 
ambiguation,  which  is  an  improvement  over  the  80%  performance  of  the  statistical 
parser  without  using  context.  For  the  statement  vs  query” if  ambiguity,  the  dis¬ 
course  processor  has  a  performance  of  85%. 

Statistical  Methods  for  Using  Context  for  Disambiguation 

As  we  described  above,  the  statistical  scores  assigned  by  the  parser  are  based  on 
sentence  structure  without  taking  the  context  of  surrounding  sentences  into  account. 
In  this  section  we  describe  a  statistical  approach  that  uses  context  to  help  parse 
disambiguation.  This  work  involved  assigning  probabilities  to  full  utterances.  We 
consider  a  full  utterance,  U,  as  a  sequence  of  sentences  represented  by  ILTs.  Such  an 
utterance  could  be  assigned  an  approximated  bigram  probability  by  the  formula: 

Pr([/)  =  Pr(ILTi,ILT2,...,ILTn)  =  nPr(ILT,  i  ILT,.i)  (1) 

1=1 

If  ILT,  is  the  first  ILT  of  an  utterance,  then  ILT._i  is  the  last  ILT  in  the  previous 
utterance  of  the  other  speaker. 

Because  we  can  not  compute  bigrams  of  full  ILTs,  our  preliminary  work  has  in¬ 
volved  computing  the  probabilities  of  the  sentence“t3rpe,  speech“act  and  top-level 
frame  of  an  ILT  using  the  bigram  probabilities  described  below.  Standard  smoothing 
technicjues  are  used  to  calculate  the  conditional  probabilities.  Because  we  take  into 
account  the  speakers  of  the  current  and  previous  sentences,  a  slot  from  the  previous 
ILT  is  considered  differently  depending  on  if  it  was  uttered  by  the  same  speaker  or 
not.  The  amount  of  training  data  was  not  sufficient  to  calculate  more  complex  N- 
grams  such  as  Pr(frame,i  |  framen-i  sentence-type„_i  speech-act„_j)  or 
Pr(frameT,  |  framCn-i  framcn— 2)  •  We  thus  compute  only  the  following  probabilities: 

Pi  =  Pr(sentence-type„  |  sentence-type„_i) 

Pj  =  Pr(sentence-type„  |  speech-act„_i) 

P3  =  Pr(sentence-type„  |  frame„_i) 
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Random 

Statistical  Parse 
Disambiguation 

- ITT - 

N-snram 

Number  of 
—SsatsBSfia _ 

Cross-talk 

41% 

81% 

84% 

88% 

91 

Push-to-talk 

39% 

76% 

81% 

83% 

161 

Total 

40% 

78% 

83% 

85% 

252 

Table  2;  Disambiguation  of  All  Ambiguous  Sentences 


P4  =  Pr(frame„  |  sentence-type„_i) 

P5  =  Pr(frame„  |  speech-act„_i) 

Pe  =  Pr(frame„  j  frame„_i) 

The  above  probabilities  together  with  the  parser’s  score,  Po>  are  interpolated  to 
assign  the  ILT’s  conditional  probability  Pr(ILT„  |  ILT„_i)  =  where  the 

weights  sum  to  one  and  are  assigned  so  as  to  maximize  the  performance  of  the  model. 


4  Comparison  of  Disambiguation  Methods 

Each  of  the  disambiguation  methods  described  above  was  trained  or  developed  on  a  set 
of  thirty  Spanish  scheduling  dialogs  and  tested  on  a  set  of  fifteen  previously  unseen 
dialogs.  The  development  set  and  test  set  both  contain  a  mixture  of  dialogs  that 
were  recorded  in  two  different  modes.  In  push-to-talk  dialogs,  participants  cannot 
interrupt  each  other.  The  speaker  must  hit  a  key  to  indicate  that  he  or  she  is  finished 
speaking  before  the  other  participant  can  speak.  In  cross-talk  dialogs,  the  participants 
can  interrupt  each  other  and  speaik  simultaneously.  Each  speaker  is  recorded  on  a 
separate  track.  Push-to-talk  sentences  tend  to  be  longer  and  more  complex. 

Table  2  shows  the  performance  of  three  disambiguation  methods  in  comparison  to 
a  baseline  method  of  selecting  a  parse  randomly.  The  three  disambiguation  methods 
are  cumulative  in  the  sense  that  each  one  builds  on  the  previous  one.  The  first 
method,  Grammar  Preferences,  involves  the  explicit  coding  of  preferences  in  grammar 
rules.  The  second  method.  Statistical  Parse  Disambiguation,  refers  to  the  parse  score 
computed  by  the  GLR*  parser,  which  takes  into  account  the  probabilities  of  actions 
in  the  GLR*  parsing  table  as  well  as  the  grammar  preferences.  The  third  method, 
ILT  n-grams,  disambiguates  top-level  frames,  sentence- types,  zmd  speech-acts,  but 
relies  on  the  parse  score  to  resolve  other  ambiguities.  As  can  be  seen  in  Table  2  and 
Figure  7,  each  method  adds  a  slight  improvement  over  the  others  that  it  incorporates. 

Table  3  shows  the  performance  of  four  disambiguation  methods  in  resolving  sentence- 
t3rpe  2imbiguities,  The  first  row  shows  performance  on  the  most  common  ambiguity  in 
Spanish — the  ambiguity  between  statements  and  yes-no  questions  (query-if ).  With¬ 
out  access  to  intonation,  statements  are  often  indistinguishable  from  yes-no  questions 
because  they  have  the  szune  word  order  in  some  circumstances.  The  four  methods 
compEired  are  the  Grammar  Preferences,  Statistical  Pcirse  Disambiguation,  and  ILT 
N-grams  described  above,  as  well  as  Discourse  Plan  Inference.  The  Discourse  Plan 
Inference  is  not  cumulative  with  the  other  disambiguation  methods.  The  input  to  the 
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Figure  7:  Disambiguation  of  All  Ambiguous  Sentences 


Random 

IpgjgjgMigai 

57% 

82% 

80% 

85% 

94% 

114 

51% 

82% 

80% 

86% 

90% 

166 

Table  3:  Disambiguation  of  Sentence  Types 


plan  inference  system  is  all  of  the  ambiguous  ILTs  from  the  parser,  without  statistical 
parse  scores.  In  this  table,  performance  is  calculated  for  the  correct  disambiguation 
of  sentence-type  only.  Other  ambiguities  in  the  same  sentences  are  not  counted.  The 
context-based  methods,  ILT  N-grams  and  Discourse  Plan  Inference,  perform  better 
than  the  sentence-based  methods  in  resolving  the  ambiguity  between  statements  and 
yes-no  questions.  The  second  row  of  the  table  shows  performance  on  all  sentence-type 
ambiguities.  Here  also,  the  context-based  methods  do  better  than  the  sentence-based 
methods. 
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5  Conclusion 


The  approach  we  have  taken  is  to  allow  multiple  hypotheses  and  their  corresponding 
ambiguities  to  cascade  through  the  translation  components,  accumulating  information 
that  is  relevant  to  disambiguation  along  the  way.  In  contraist  to  other  approaches  that 
use  predictions  to  filter  out  ambiguities  e£irly  on,  we  delay  ambiguity  resolution  as 
much  as  possible  imtil  the  stage  at  which  all  knowledge  sources  can  be  exploited. 
A  consequence  of  this  approach  is  that  much  of  our  research  effort  is  devoted  to 
the  development  of  an  integrated  set  of  discunbiguation  methods  that  make  use  of 
statistical  and  sjrmbolic  knowledge. 

In  this  paper  we  examined  four  disambiguation  methods,  two  that  are  sentence- 
based  and  two  that  use  discourse  context.  In  our  experiments,  the  context-based 
methods  performed  somewhat  better  than  the  sentence-based  methods.  However, 
we  believe  that  the  best  approach  will  be  jm  integration  of  these  and  possibly  other 
methods.  Our  future  work  will  involve  in  particular  how  to  combine  the  knowledge 
provided  by  the  discourse  processor  with  that  provided  by  the  parser  and  ILT  N- 
grams.  We  believe  that  this  is  a  promising  path  to  follow  because  different  sets  of 
sentences  are  correctly  disambiguated  by  each  of  the  methods.  Another  feature  of 
our  future  work  will  be  to  evaluate  the  effect  of  improved  disambiguation  on  overall 
end-to-end  translation  quality. 
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ABSTRACT 

In  our  effott  to  build  spoken  language  niulation  systems  we  have 
extended  our  JANUS  system  to  process  spontaneous  human-human 
dialogs  in  a  new  domain,  two  people  trying  to  schedule  a  meeting. 
Trained  on  an  initial  database  JANUS>2  is  able  to  translate  En¬ 
glish  and  Geiman  spoken  input  in  either  English.  German.  Spanish. 
Japanese  or  Korean  outpuL  To  tackle  the  difficulty  of  spontaneous 
human-human  dialogs  we  improved  the  JANUS-2  recognizer  along 
its  three  knowledgesources  acoustic  models,  dictionary  and  language 
models.  We  developed  a  robust  translation  system  which  perfoims 
semantic  rather  than  syntactic  analysis  and  thus  is  pardculaiy  suited 
to  processing  spontaneous  speech.  We  describe  repair  methods  to 
recover  fitom  recognition  errors. 

1.  Introduction 

JANUS  [1.  2]  has  been  among  the  first  systems  attempting 
to  provide  spoken  language  translation.  While  the  previous 
JANUS- 1  system  processed  syntactically  wellformed  read 
speech  over  a  500  word  vocabulary,  JANUS-2  operates  on 
spontaneous  human-human  dialogs  in  a  scheduling  domain 
with  vocabularies  exceeding  2000  words.  Currently.  English 
and  German  spoken  input  can  be  translated  in  either  English, 
German,  Spanish,  Japanese  or  Korean  output.  Work  is  in 
progress  to  add  Spanish  and  Korean  as  input  languages. 

This  paper  reports  on  the  current  status  of  the  system  and  ongo¬ 
ing  efforts  to  extend  and  improve  the  recognition  component 
Then,  we  describe  our  new  approach  to  robust  translation  of 
spoken  language.  We  briefly  describe  and  compare  the  alter¬ 
native  approach  to  parsing  and  translation  we  pursue,  based 
on  a  generalized  robust  LR  parser  and  an  ILT.  Hnally  we  re¬ 
port  on  efforts  to  detect  erroneous  system  output  and  provide 
interactive  methods  to  recover  from  such  errors. 

2.  Current  Status  of  JANUS 
2.1.  Data  Collection 

Data  collection  to  establish  a  large  database  of  spontaneous 
human-human  negotiation  dialogs  in  English  and  German  has 
started  about  18  months  ago.  In  the  meantime,  several  sites 
in  Europe,  the  US  and  Asia  have  adopted  the  Scheduling  task 


underseveralresearchprojectsandfundingsources.  Since  the 
same  calendars  and  data  collection  protocols  are  used  the  data 
elicited  shares  the  same  domain  and  procedural  constraints. 


English  Scheduling 

dialogs 

words 

recorded 

1984 

505  K 

transcribed 

1826 

460K 

German  Scheduling 

dialogs 

words 

recorded 

734 

158  K 

transcribed 

534 

115K 

1  Spanish  Scheduling 

dialogs 

words 

recorded 

340 

79  K 

transcribed 

256 

70  K 

1  ATIS3 

transcribed 

nJSL 

250  K 

Ihble  1:  Comparison  of  Databases  (as  of  December  1994) 

Ihble  1  summarizes  the  current  status  of  data  collection. 
Since  Scheduling  utterances  typically  consist  of  more  than 
one  sentence,  there  is  already  more  dma  available  for  English 
Scheduling  than  ATIS  '.  More  data  collection  will  establish 
databases  in  size  at  least  comparable  to  ATIS  for  all  languages. 

In  Spanish,  we  have  explored  two  diflferent  data  collection 
scenarios;  To  allow  only  one  person  to  speak  at  a  time  the 
push-to-talk  scenario  requires  the  speaker  to  push  a  button 
while  talking  to  the  system.  The  cross-talk  scenario  allows 
speaken  to  speak  simultaneously  without  push  button.  The 
speech  of  each  dialog  panner  is  recorded  on  separate  channels. 

2.2.  System  Overview 

The  main  system  modules  are  speech  recognition,  parsing, 
discourse  processing,  and  generation.  Each  module  is  lan- 

‘The  about  18000  uacfancea  in  Engiish  Scheduling  coiiespand  to  lome 
30000  lenience*. 


guage-independent  in  the  sense  that  it  consists  of  a  general  pro¬ 
cessor  that  applies  independently  specified  knowledge  about 
dififerent  languages. 

The  recognition  module  decodes  the  speech  in  the  source  lan¬ 
guage  into  a  list  of  sentence  candidates,  represented  either  as 
a  word  lanira  or  Nbest  list  At  the  core  of  the  machine  trans¬ 
lation  components  is  a  language  independent  representation 
of  the  meaning,  which  is  extracted  finom  the  recognizer  output 
by  the  parsing  module.  As  last  step,  the  final  language  inde¬ 
pendent  representation  is  sent  to  the  generator  to  be  translated 
in  any  of  the  target  languages.  Figure  1  shows  the  system 
architecture. 

After  parsing,  a  discourse  processor  can  be  used  to  put  the 
current  utterance  in  the  context  of  previous  utterances,  open¬ 
ing  possibilities  to  integrate  the  speech  and  natural  language 
processing  compenents  of  the  system  to  resolve  parsing  am¬ 
biguities  and  dynamically  adapt  the  vocabulary  and  language 
model  of  the  recognizer  based  on  the  current  discourse  state. 


SpmekOu^inTar^iMn^tagt 


Figure  1:  System  Diagram 

We  explore  several  approaches  for  the  main  processes.  For 
example,  we  are  experimenting  with  TDNN,  MS-TDNN  [3], 
MLP,  LVQ  [4],  and  HMM’s  (5,  12]  for  acoustic  modeling; 
n-grams,  word  clustering,  and  automatic  phrase  detection  for 
language  modeling  [6];  statistically  trained  skipping  pars¬ 
ing  [7,  8],  neural  net  parsing  [9]  and  concept  spotting  pars¬ 
ing  [10]  for  extracting  the  meaning;  and  statistical  models 


as  well  as  plan  inferencing  for  identification  of  the  discourse 
state  [11].  This  multi-strategy  approach  should  lead  to  im¬ 
proved  performance  with  appropriate  weighting  of  the  output 
from  each  strategy. 

23.  Recognition  Performance  Analysis 

The  baseline  JANUS-2  recognizer  can  be  described  as  fol¬ 
lows: 

•  Preprocessing:  LDA  on  melscale  fouher  spectrum  and 
additional  acoustic  features  (power,  silence) 

•  Acoustic  modeling.  LVQ-2  or  phonetically  tied 
SCHMM,  no  cross-word  triphones,  explicit  noise  mod¬ 
els 

•  Decoder.  Viterbi  search  as  fint  pass,  followed  by  a  word- 
dependent  Nbest  search,  standard  word  bigram  language 
model,  word  lattice  output 

Current  recognition  results  on  the  English.  German  and  Span¬ 
ish  Spontaneous  Scheduling  Task  (ESST,  GSST,  SSST)  can 
be  seen  in  table  2. 


1  ESST 

GSST  1  SSST 

Word  Accuracy 

66% 

72%  1  61% 

Table  2:  JANUS-2  baseline  recognition  performance 

The  low  absolute  recognition  accuracies  arc  due  to  the  chal¬ 
lenging  nature  of  human-human  spontaneous  speech.  In  the 
oflBcial  evaluation  of  the  German  VERBMOBIL  project  on 
the  GSST  task,  the  JANUS-2  decoder  outperformed  all  other 
participating  systems.  In  addition,  recent  evaluations  on 
the  Switchboard  task  confirm  that  human-human  dialogs  are 
much  more  difficult  to  recognize  than  human-machine  spon¬ 
taneous  speech  (like  ATIS).  Participating  systems  achieved 
word  accuracies  between  30%  and  50%, 

Analysis  shows  that  human-human  dialogs  (like  Scheduling 
or  Switchboard)  are  more  difficult  to  recognize  than  human- 
machine  dialogs  (c.g.  ATIS).  Perplexities  lie  between  35  and 
90  for  ESST,  SSST  and  GSST,  and  somwhat  over  100  for 
Switchboard.  Additionally,  human-human  dialogs  are  signif¬ 
icantly  more  disfluent  [8].  Large  variations  in  speaking  rates 
and  strong  coarticulation  between  words  contribute  signifi¬ 
cantly  to  the  difficulty  of  recognizing  human-human  sponta¬ 
neous  speech. 

3.  Improving  the  Recognition  Component 
We  describe  efforts  to  improve  the  recognition  component 
along  its  major  knowledge  sources  acoustic  models  [12],  dic¬ 
tionary  [13]  and  language  models  [14], 
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3.L  Data-Driven  Codebook  Adaptation 

We  developed  methods  aimed  at  automatic  optimization  of  the 
number  of  parameters  for  the  semi-continuous  phonetically 
tiedHMM  used  in  JANUS-2.  Usually,  afixednumberof  code¬ 
book  vectors  is  assigned  to  each  of  the  phonemes.  However, 
as  the  available  training  data  differs  between  phonemes  and 
the  size  of  the  feature  space  phonemes  cover  varies  greatly, 
constant  codebook  size  leads  to  suboptimal  allocation  of  re¬ 
sources. 

We  have  therefore  suggested  [12]  to  adapt  the  codebook  size 
of  each  phoneme  according  to  the  amount  and  the  distribution 
of  the  training  Hata,  similar  to  [IS].  During  training,  the  size 
of  the  codebook  is  incrementally  increased.  Some  quality 
criterion  determines  when  to  stop  the  process  of  increasing 
the  codebook.  We  compared  a  variance  criterion  based  on 
the  average  distance  between  data  points  and  their  nearest 
codebook  vector  with  a  prediction  criterion  which  tries  to 
capture  how  well  the  modeling  of  the  recognizer  can  predict 
unseen  data. 


Model 

Codebook  Size 

Word  Accuracy 

baseline 

4600 

66.9% 

variance 

4201 

69.9% 

prediction 

1677 

67.8% 

Ihble  3:  Results  for  Codebook  Adaptation  (GSST) 

Thble  3  compares  recognition  accuracies  and  codebook  sizes 
of  the  baseline  models,  with  models  automatically  adapted 
using  the  variance  and  prediction  criterion.  As  can  be  seen, 
codebook  adaptation  leads  to  significant  error  reduction  if  the 
same  number  of  parameters  is  usetLThe  number  of  parameters 
can  be  reduced  by  40%  with  still  better  performance  than  the 
baseline  system. 

3,2.  Dictionary  Learning 

Due  to  the  enormous  variability  in  spontaneous  human- 
human  dialogs  creating  adequate  dictionaries  with  alterna¬ 
tive  pronunciations  is  crucial  [16].  However,  hand  tuning  and 
modifying  dictionaries  is  time  consuming  and  labor  intensive. 
Pronunciations  of  a  word  should  be  chosen  according  to  their 
frequency.  Modifications  of  the  dictionary  should  not  lead 
to  higher  phonetic  confusability  after  retraining.  Therefore 
we  have  proposed  [13]  a  data-driven  approach  to  improve 
existing  dictionaries  and  automatically  add  new  words  and 
pronunciation  variants  whenever  needed. 

The  learning  algorithm  requires  transcripts  for  the  whole  train¬ 
ing  set  and  a  phoneme  coiifusability  matrix  of  the  speech  rec¬ 
ognizer  used.  First,  phonetic  transcripdons  for  all  appearances 
of  each  word  are  generated  by  help  of  a  phoneme  recognizer. 


Then,  variants  which  are  infrequent  or  which  would  lead  to 
erroneous  training  of  confiisabie  phonemes  are  eleminated. 
Finally,  the  acoustic  models  are  retrained  allowing  for  the 
newly  aquired  pronunciations  variants. 

As  can  be  seen  in  table  4,  our  algorithm  for  adapting  and 
adding  phonetic  transcriptions  to  a  dictionary  improves  the 
recognition  accuracy  of  the  decoder  significantly  and  leads  to 
performance  that  is  comparable  to  the  context  dqrendent  re- 
sults(cf.  table2).  The  baseline  decoder  for  these  experiments 
uses  69  contextindependentphoneme  models.  Evaluation  us¬ 
ing  context  dq>endent  models  is  in  progress. 


Dictionary 

Word  Accuracy 

baseline 

61.7% 

65.6% 

Table  4:  Results  Dictionary  Learning  (GSST) 

3  J.  Morpheme  Based  Language  Models 

Based  on  our  scheduling  databases  we  noticed  that  in  mor¬ 
phologically  rich  languages  such  as  German  and  Spanish, 
dictionaries  grow  much  faster  with  increasing  database  size, 
compared  to  English  (cf.  figure  2).  This  is  due  to  the  large 
number  of  inflections  and  compound  words.  One  way  to  limit 
this  growth  with  increasing  dictionary  sizes  is  to  use  other 
base  units  than  words. 


We  compared  three  different  decomposition  methods: 

•  strictly  morpheme  based  decomposition,  e.g.  wegge- 
hen  (to  go  away)  —  weg-geh-en,  Spracherkennung 
(speech  recognition)  — ►  Sprach-er-kenn-ung 

•  decomposition  in  roor/ormr,e.g.  weggehen(togoaway) 
— *  weggeh@,  Spra^erkennung  (speech  recognition) 
— Spracherkenn@ 
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•  combination  of  strictly  morpheme  based  decomposition 
and  root  forms 

Table  5  shows  dictionary  size,  bigram  perplexity  and  recog¬ 
nition  accuracy  using  the  respective  decomposition  method, 
based  on  250  GSST  dialogs.  As  can  be  seen,  all  decompo¬ 
sition  methods  significantly  reduce  vocabulary  size  and  per¬ 
plexity.  The  impaa  on  recognition  accuracy  is  still  small. 
This  may  be  due  to  the  fact  that  the  acoustic  modeling  suffers 
from  smaller  units  and  thus  deteriorate  the  gain  in  the  lan¬ 
guage  modeL  In  a  real  interface,  however,  this  reduction  in 
vocabularygrowthleads  to  a  reduction  of  new  words.  Further 
research  will  focus  on  finding  more  efficient  and  acoustically 
less  confusable  decompositions  automatically,  and  also  test 
the.impact  on  translation. 


Dictionary 

Perplexity 

Accuracy 

Baseline 

3821 

88 

64.7% 

Morphemes 

2391 

46 

65.4% 

Root  Forms 

3205 

79 

63J% 

Combined 

2998 

59 

65.1% 

Table  5:  Comparison  of  Decomposition  Methods  (GSST) 

4.  Concept  Based  Speech  TVanslation 

We  have  developed  a  robust  translation  system  based  on  the 
information  structures  inherent  in  the  appointment  scheduling 
task  being  performed,  described  in  detail  elsewhere  [10].  The 
basic  premise  is  that  the  structure  of  the  information  conveyed 
is  largely  independent  of  the  language  used  to  encode  iL  Our 
system  tries  to  model  the  information  structures  in  a  task 
and  the  way  these  structures  are  realized  in  words  in  various 
languages.  This  system  is  an  extension  of  the  Phoenix  S  poken 
Language  System  [18].  It  uses  the  Phoenix  parser  to  parse 
input  into  slots  in  semantic  frames,  and  then  uses  these  frames 
to  generate  output  in  the  target  language. 

4.L  The  Parser 

Unlike  individual  words,  semantic  units  used  in  a  task  domain 
are  not  language  specific.  Based  on  transcripts  of  scheduling 
dialogs,  we  have  developed  a  set  of  fundamental  semantic 
units  in  our  parse  which  represent  the  different  concepts  a 
speaker  would  use.  For  instance,  a  typical  temporal  token 
could  have  date  as  subtoken,  which  could  in  turn  consist  of 
month  and  day  subtokens.  The  temporal  could  be  part  of  a 
statement  of  unavailability. 

In  contrast  to  previous  speech  translation  systems,  we 
presently  don’t  perform  syntactic  analysis.  Speaker  utter¬ 
ances,  as  decoded  by  the  recognizer,  arc  parsed  into  semantic 
chunks  which  are  concatenated  without  grammatical  rules. 


Original  utterance: 

THAT  SATURDAY  I'M  NOT  SURE  ABOUT  BUT  YOU  SAID 
YOU  MAY  BE  BACK  IF  YOU  THINK  YOU'LL  BE  BACK 
THE  THIS  SUNDAY  THE  TWENTY  EIGHTH  I  COULD  SEE 
YOU  AFTER  ELEVEN  AM  ON  THAT  IF  YOU'RE  BACK 

Translated: 

SMtardty  not  io  good  forme  Sundey  the  twenty  eighth  woiki  forme 

ifier  eleven  tun.  (ENGLISH) 

El  tibedo  nomevM  deniasiadd  6ien  peio  el  domingo  veintiocho  me  vi  been 
dexpu^Mde  Um  once  de  ia  manaoa.  ('SPANZSH) 

SimsUg  kdoDte  icb  our  zur  Not  iber  Soanteg  der  Achtundzwinzigste  geht 
beimirgMazgutttMchelfUbrmotgens,  (GERMAN) 

Rgurc  3:  Translation  Example 

This  approach  is  particularly  well  suited  to  parsing  sponta¬ 
neous  speech,  which  is  often  ungrammatical  and  subjea  to 
recognition  errors.  This  approach  is  more  robust  than  requir¬ 
ing  well-formed  input  and  reliance  on  syntactic  cues  provided 
by  short  function  words  such  as  articles  and  prepositions. 

42.  The  Generator 

The  generauon  component  of  the  system  is  a  simple  left-to- 
right  processing  of  the  parsed  text.  The  translation  grammar 
consists  of  a  set  of  target-language  phrasings  for  each  token, 
including  lookup  tables  for  variables  like  numbers  and  days 
of  the  week.  When  a  lowest-level  token  is  reached  in  tracing 
through  the  parse,  a  target-language  representation  is  created 
by  replacing  tokens  with  templates  for  the  parent  token,  ac¬ 
cording  to  the  translation  grammar.  The  result  is  a  meaningful, 
although  terse  translation,  which  emphasizes  communicating 
the  main  point  of  an  utterance.  An  examples  is  illustrated  in 
figures. 

4J.  Results 

We  have  implemented  this  system  for  bi-directional  uansla- 
tion  between  English,  German  and  Spanish  in  our  scheduling 
tack-  Thble  4  shows  the  performance  of  parser  and  subse¬ 
quent  generator  on  transcribed  data.  Evaluation  of  the  system 
based  on  speech  decoded  by  the  JANUS-2  recognizer  is  still 
underway. 


Parsed  from 

Translated  imo  i 

token 

utterance  j 

utterance 

English 

95.6% 

90.0%  i 

90.2% 

German 

92.4 

89.6  1 

87.3  i 

Spanish 

88.8 

58.3 

82.2  i 

Rgure  4:  End-to-End  evaluation  on  uanscribed  data 
One  disadvantage  of  this  approach  is  the  telegraphic  and  react 
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itive  nature  of  the  translations.  This  could  be  overcome  by 
providing  multiple  translation  options  for  individual  tokens  in 
the  target-language  module,  different  levels  of  politeness,  etc. 
However  at  present  we  feel  that  it  is  sufhcient  for  intelligible 
communcation. 

5.  GLR*  Parser 

In  addition  to  the  concq)t  based  Phoenix  parser  we  pmsue 
GLR*  as  robust  extension  of  the  Generalized  LR  Parser.  It 
attempts  to  find  maximal  subsets  of  the  input  that  are  parsable, 
skipping  over  unrecognizable  parts  of  the  input  sentence  [7]. 
By  means  of  a  semantic  grammar  GLR*  parses  input  sen¬ 
tences  into  an  interlingua  text  (ILT)  as  language  independent 
representation  of  the  meaning  of  the  input  sentence,  described 
in  more  detail  elsewhere  (e.g.  [8]). 

Compared  to  Phoenix  parses  the  ILT  generated  by  GLR* 
offers  greater  level  of  detail  and  more  specificity,  e.g.  different 
speaker  attitudes  and  levels  of  politeness.  Thus,  translation 
based  on  ILT^  is  more  natural,  overcoming  the  telegraphic  and 
terse  nature  of  concept  based  translation. 

A  drawback  of  GLR*  was  that  it  expected  input  segmented 
into  sentences  for  efficiency  reasons.  However,  typical 
Scheduling  utterances  consist  of  2-3  sentences.  To  integrate 
the  parser  with  the  speech  decoder,  we  developed  methods 
which  extend  the  parsing  capabilities  from  single  sentences  to 
multi-sentence  utterances.  We  extended  the  grammar  with  a 
high-level  rule  that  allows  the  input  utterance  to  be  analyzed 
as  a  concatenation  of  several  sentences  and  developed  two 
methods  to  constrain  the  number  of  sentence  breaks  that  are 
considered  by  the  paner.  The  first  is  a  heuristic  which  prunes 
out  all  parses  that  are  not  minimal  in  the  number  of  sentences. 
The  second  is  a  statistical  method  to  disregard  potential  sen¬ 
tence  breaking  points  that  are  statistically  unlikely. 

For  the  English  analysis  grammar,  time  efficiency  thus  im¬ 
proved  by  about  30%.  As  an  additional  benefit,  the  parse 
quality  improved  because  strange  sentence  breaks  are  rejected 
in  favor  of  a  more  reasonable  location. 

6.  Handling  Unreliability 

Although  research  has  boosted  performance  of  speech  recog¬ 
nition  and  spoken  language  translation  technology,  recogni¬ 
tion  and  translation  errors  will  persist.  To  build  a  system 
for  use  in  real  applications  we  need  repair  methods  to  re¬ 
cover  from  errors  in  a  graceful  and  unobstrusive  way.  We 
have  developed  a  speech  interface  for  repairing  recognition 
errors  by  simply  respeaking  or  spelling  a  misrecognized  sec¬ 
tion  of  an  utterance.  While  much  speech  “repair”  work  has 
focused  on  repairs  within  a  single  spoken  utterance  [19],  we 
are  concerned  with  the  interactive  repair  of  errotful  recognizer 
hypotheses  [20]. 


6.1.  Identifying  Errors 

lb  be  able  to  repair  an  error  its  location  has  to  be  determined 
first.  We  pursue  two  strategies  to  identify  misrecognitions  as 
subpieces  of  the  initial  recognizer  hypothesis. 

The  automatic  subpiece  location  technique  requires  the  user  to 
respeak  only  the  eiTCsful  subsection  of  the  (primary)  utterance. 
This  (secondary)  utterance  is  decoded  using  a  vocabulary  and 
language  model  limited  to  substrings  of  the  initial  erroneous 
hypothesis.  Thus,  the  decoding  identifies  the  respoken  section 
in  the  hypothesis.  Preliminary  testing  showed  that  the  method 
works  poorly  if  the  subpiece  to  be  located  is  only  one  or  two 
words  long.  However,  this  drawback  is  not  severe  since 
humans  tend  to  respeak  a  few  words  around  the  error. 

A  second  technique  uses  confidence  measures  todeterminefor 
each  word  in  the  recognizer  hypothesis  whether  it  was  misrec¬ 
ognized.  First,  we  applied  a  technique  similar  to  Ward  [21], 
which  turns  the  score  for  each  word  obtained  during  decoding 
into  a  confidence  measure  by  normalizing  the  score  and  using 
a  Bayesian  updating  technique  based  on  histograms  of  the 
normalized  score  for  correct  and  misrecognized  words.  Since 
we  found  this  not  to  work  well  on  our  English  scheduling 
task,  we  are  currently  developing  different  methods  to  com¬ 
pute  confidence  measures  based  on  decoder,  language  model 
and  parser  scores. 

62.  Robust  Speech  Repair 

After  locating  and  highlighting  erroneous  sections  in  the  rec¬ 
ognizer  hypothesis  misrecognitions  are  corrected. 

The  spoken  hypothesis  correction  method  uses  Nbest  lists  for 
both  the  initial  utterance  and  the  respoken  section.  The  Nbest 
for  the  highlighted  section  of  the  initial  utterance  is  rescored 
usingseores  from  decoding  the  secondary  utterance.  Depend¬ 
ing  on  the  quality  of  the  Nbest  lists,  most  misrecognitions  can 
be  corrected. 

The  spelling  hypothesis  correction  method  requires  the  user  to 
spell  the  highlighted  erroneous  section.  A  spelling  recognizer 
decodes  the  spelled  sequence  of  letters.  By  means  of  a  lan¬ 
guage  model  we  restrict  the  sequence  of  letters  to  alternatives 
found  among  the  Nbest  from  the  located  section. 

lb  date,  we  have  evaluated  our  methods  over  sentences  from 
the  Resource  Management  task.  Ihble  6  shows  the  improve¬ 
ments  in  sentence  accuracy,  based  on  recordings  from  one 
speaker  of  the  February  and  October  1989  test  data.  We 
selected  a  subset  of  erroneous  utterances;  therefore  the  ac¬ 
curacy  of  the  baseline  system  is  significantly  lower  than  the 
94%  performance  our  system  achieves  on  the  whole  test  seL 
The  results  indicate  that  repeating  or  spelling  a  misrecognized 
subsection  of  an  utterance  can  be  an  effective  way  to  repair 
recognition  utterances. 
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Title  6:  Improvement  of  Sentence  Accuracy  by  Repair 


7.  Conclusions 

We  have  made  significant  advances  towards  building  a  multi 
lingual  translation  system  for  spontaneous  human-human  di¬ 
alogs.  Beyond  speech  recognition  of  spontaneous  speech 
JAhhjS  provides  a  framework  to  investigate  important  areas 
like  robust  parsing,  machine  translation  of  spoken  language 
and  developing  methods  to  recover  from  recognition  and  pars¬ 
ing  errors.  To  achieve  acceptance  in  real  applications,  we  have 
to  embed  the  spoken  language  technology  in  a  sensible  and 
useful  user  int^ace  that  is  carefully  designed  around  human 
fxtors  and  common  needs.  To  be  flexible  and  robust,  such 
interfaces  should  not  only  recognize  speech  but  also  recog¬ 
nize  other  communication  modalities,  provide  freedom  from 
headset  and  push-buttons,  allow  for  graceful  recovery  from 
errors  and  miscommunications,  know  what  they  don’t  know, 
and  model  what  the  user  does  or  doesn’t  know  [23]. 
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Abstract 

Machine- Aided  Voice  Translation  (MAVT)  is  a  de¬ 
velopment  begun  in  1990  for  a  spoken  language 
translation  prototype  whose  primary  use  is  to  assist 
.Air  Force  interrogation  personnel  in  interacting  with 
speakers  of  foreign  languages.  A  significant  potential 
use  of  the  MAVT  prototype  is  to  provide  similar  sup¬ 
port  for  law  enforcement  personnel,  who  have  shown 
considerable  interest  in  the  development.  The  paper 
describes  the  second  phase  of  MAVT  development  - 
which  will  result  in  a  speaker-independent,  continu¬ 
ous  speech,  multilingual  translation  prototype  for  En¬ 
glish  =>  Spanish|Arabic|Russian  =>  English. 

• 

1  Introduction 

Machine- Aided  Voice  Translation  (MAVT)  is  a  de¬ 
velopment  begun  in  1990  under  contract  to  Rome 
Laboratory,  AFMC,  for  a  spoken  language  translation 
prototype  to  assist  Air  Force  personnel  in  interacting 
with  speakers  of  foreign  languages.  The  initial  phase 
of  the  project,  which  concluded  in  1992,  resulted  in 
the  development  of  a  speaker-independent  continuous 
speech,  translation  system  for  English  =>  Spanish  => 
English,  using  a  vocabulary  of  about  500  words.  An 
overview  of  the  system  as  well  as  a  summary  of  eval¬ 
uation  results  are  given  in  [1], 

This  paper  describes  the  Phase  II  MAVT  ADM 
.-vstem  (Figure  1),  which  provides  voice  input  and 
output  for  English  =>  Spanish| Arabic] Russian  =>  En- 
elish.  with  a  planned  vocabulary  of  approximately 

'The  work  reported  in  this  paper  is  supported  by  AFMC. 
P.ome  Laboratory/IRAA,  Griffiss  Air  Force  Base.  XY.  under 
Contract  Xo.  F30602-93-C-0098.  Earlier  work  was  supported 
inder  voiuract  Xo.  F3060'2-90-C-0058. 


1.000  words  per  language.  Like  the  Phase  I  sys¬ 
tem.  the  current  system  is  comprised  of  three  subsys¬ 
tems:  a  speech  recognition  system,  a  natural  language 
processing  system,  and  speech  generators.  Speaker- 
independent.  continuous  speech  recognition  is  accom¬ 
plished  via  Entropic’s  HMM  Toolkit,  while  speech 
synthesis  for  English  and  Spanish  utilizes  Entropies 
TrueTalk^’^,  licensed  from  AT&T.  (Generators  for 
.Arabic  and  Russian  are  still  under  negotiation  at  this 
time.)  .4s  in  the  Phase  I  system,  natural  language 
understanding  and  translation  generation  is  achieved 
via  LSTs  DBG  natural  language  processing  system, 
which  has  been  extended  to  incorporate  a  language- 
independent  translation  component  that  integrates 
predicate  representations  based  on  Jackendoff’s  Le.x- 
ical  Conceptual  Structures  (henceforth  LCS)  [2]. [3] 
with  DBG's  frames  and  lexicon  [4].  These  three  sub¬ 
systems  are  briefly  described  in  the  following  sections. 

2  The  DBG  Natural  Language 
Processing  System 

LSI's  DBG  system  has  served  as  the  NLP  engine  for 
a  variety  of  text  understanding  applications,  focus¬ 
ing  on  information  extraction  for  data  base  genera¬ 
tion  (from  which  the  acronym  DBG  is  derived)  for  a 
range  of  different  types  of  text,  and  message  fusion, 
based  on  a  large  sample  of  transcribed  radiotelephone 
traffic.  The  components  of  the  DBG  system  as  config¬ 
ured  for  these  applications  include  modules  for  lexical 
lookup  and  morphological  analysis,  full  syntactic  and 
semantic  analysis,  and  discourse  or  text-level  analy¬ 
sis.  The  analyzed  content  of  a  text  is  represented  as 
a  set  of  interconnected  frame  structures  called  tem¬ 
plates.  which  reflect  the  entities  and  events  described 
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Figure  1.  Version  1.5  MAVT  ADM  System  Diugrum 


in  a  source  text. 

For  the  MAVT  application,  modules  were  added 
to  generate  the  target  language  text.  In  the  Phase  I 
M.WT  development,  a  direct  transfer  strategy  was 
used  to  achieve  translation,  although  many  of  the 
components  were  designed  for  multilingual  use.  In 
the  current  MAVT  development,  we  have  adopted 
an  interlingual  approach  to  translation.  Much  of 
the  extension  of  the  DBG  system  for  the  MAVT 
project  has  necessarily  focused  on  the  multilingual 
capabilities  of  the  system.  In  the  first  phase  of  the 
project,  the  DBG  system  already  had  in  place  a  mul¬ 
tilingual  syntactic  parser  that  was  used  for  Span¬ 
ish  and  English.  An  updated  version  of  this  parser 
will  be  used  to  parse  Arabic  and  Russian  as  well. 
DBG  also  produces,  as  output  of  the  understand¬ 
ing  phase  of  processing,  a  knowledge  representation 
of  the  sentence.  This  knowledge  representation  is 
an  application-independent  data  structure  of  related 
event  and  entity  frames  based  on  the  predicates  and 
arguments  of  the  sentence,  as  well  as  on  an  underlying 
frame-based  concept  hierarchy.  These  frames,  called 
templates  in  the  DBG  system,  represent  the  knowl¬ 
edge  contained  in’  a  sentence.  On  the  basis  of  this 
structure,  which  is  the  end  product  of  analysis  of  the 
source  language  (hereafter  SL)  sentence,  the  target 
language  (TL)  lexical  items  are  selected,  and  gener¬ 
ation  processing  is  applied  to  construct  a  translation 
of  the  sentence. 

The  DBG  knowledge  representation  thus  functions 
as  an  intermediate  or  interlingual  ^  henceforth.  ID 
construct.  An  IL  approach  does  not  not  rely  on  di¬ 
rect  transfer  or  direct  links  between  languages  but 
requires  a  language-independent  representation  of  the 
data,  which  can  then  be  used  to  translate  the  sentence 
into  any  language  that  the  system  can  handle.  The  IL 
approach  thus  eliminates  the  need  to  develop  a  sepa¬ 
rate,  direct  interface  between  every  potential  source- 
target  language  pair  because  each  language  need  only 
interface  with  the  language-independent  IL  represen¬ 
tation. 

From  the  commencement  of  the  MAVT  project,  in¬ 
cluding;  the  phase  I  development  LSI  s  approach  has 
been  interlingual  \n  that  it  assumes  that  the  selection 
of  lexical  items  in  the  TL  should  be  based  on  links 
to  an  intermediate  structure,  rather  than  on  direct 
or  hard  links  between  words  in  the  source  and  tar¬ 
get  languages.  In  phase  I,  this  was  realized  insofar 
as  words  corresponding  to  the  same  basic  meaning  in 
each  language  were  linked  to  common  concept  nodes 


in  the  frame-based  knowledge  hierarchy.  These  links 
are  present  in  each  event  and  entity  template  in  the 
knowledge  representation. 

For  some  lexical  categories,  e.g.,  nouns,  this  works 
well.  But  where  cross-category  relations  and  compo¬ 
sitional  semantics  are  important,  as  in  verb  phrases, 
which  express  predicate-argument  relations,  the  lexi¬ 
cal  properties  are  much  more  complex.  In  a  multilin¬ 
gual  system,  incorporating  lexical-semantic  informa¬ 
tion  for  the  words  associated  with  a  given  concept  for 
all  of  the  different  languages  into  the  concept  hierar¬ 
chy  would  greatly  increase  the  complexity  of  the  hier¬ 
archy.  A  limitation  of  using  links  to  the  concept  hier¬ 
archy  as  the  only  intermediary,  then,  is  that  the  con¬ 
cept  hierarchy  primarily  represents  meaning  relations 
between  concepts  of  the  same  category  rather  than 
representing  the  unique  properties  of  the  meanings  of 
the  individual  words  associated  with  those  concepts, 
or  the  meaning  relations  and  structural  requirements 
of  the  words  in  sentences.  .4  great  deal  of  syntac¬ 
tic  and  semantic  checking  still  remains  to  be  done  to 
determine  whether  a  potential  TL  word  is  compati¬ 
ble  with  the  meaning  and  structural  requirements  of 
the  TL  sentence.  Thus,  in  our  phase  II  development 
i  the  .4DM  phase),  we  determined  it  was  highly  de¬ 
sirable  to  construct  an  IL  representation  which  could 
rely  on  some  other  knowledge  source,  beyond  just  the 
frame- based  knowlege  hierarchy.  The  emergent  the¬ 
ory  of  Lexical-Conceptual  Structures  was  determined 
ro  be  highly  appropriate  as  a  means  of  encoding  the 
additional  knowledge  representation  required.  These 
structures,  when  combined  with  DBG's  existing  in¬ 
terlingual  characteristics,  have  proven  to  be  exactly 
the  link  needed  to  create  what  we  deemed  was  an 
appropriately  robust  IL  representation. 

The  DBG  system  has  a  modular  design,  wherein 
text  is  analyzed  in  proeressive  stages.  The  output 
of  each  stage  of  processing  is  a  data  structure  that 
then  serves  as  input  to  the  following  processing  stage. 
As  illustrated  in  figure  2.  there  are  four  stages  of  SL 
analysis  of  a  sentence  that  precede  the  IL  template 
representation:  the  IL  representation  is  then  followed 
by  four  stages  of  TL  generation.  The  four  stages  of 
SL  analysis  are:  ajlexicai  identification,  b)  morpho¬ 
logical  analysis,  c)  syntactic  parsing,  and  d)  semantic 
parsing.  The  four  stages  of  TL  generation  mirror  in 
part  the  SL  analysis:  they  are  wi  lexical  selection,  x) 
semantic  parsing,  y)  syntactic  parsing,  and  z)  mor- 
ohological  inflection  (see  F:2:ure  2:  the  acronym  RLCS 
stands  for  "Root  Le.xicai-Conceptuai  Structure",  that 
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is.  the  form  of  the  LCS  which  is  stored  with  the  lexical 
root  in  the  lexicon). 

Stages  a.b)  and  z)  are  mirror  images  of  one  another 
in  that  in  a.b)  inflected  lexical  items  are  analyzed  to 
determine  their  lexical  stems  and  morphological  fea¬ 
tures,  and  in  z)  lexical  stems  are  inflected  based  on 
the  accompanying  morphological  features.  Likewise, 
c)  and  y)  are  very  similar  in  that  in  both  the  internal 
syntactic  structure  associated  with  the  sentence  is  or¬ 
ganized  in  a  principle-based  manner,  using  a  binary- 
branching  version  of  x-bar  theory.  The  difference  be¬ 
tween  c)  and  y)  is  that  in  c)  the  structure  of  the  SL 
sentence  is  discovered  based  on  lexical  and  morpho¬ 
logical  information  derived  from  an  actual  sentence, 
whereas  in  y)  the  syntactic  structure  is  being  built 
based  on  a  semantic  outline  of  the  proposed  TL  sen¬ 
tence. 

At  the  heart  of  processing  in  the  DBG  transla¬ 
tion  system  are  the  three  intermediate  stages:  the 
SL  semantic  parse  (d,  above),  the  IL  templates,  and 
the  TL  semantic  parse  (x,  above).  These  are  where 
translation  occurs  and  it  is  into  these  data  structures 
that  we  have  incorporated  Jackendoff’s  LCS  (as  men¬ 
tioned  earlier).  An  LCS  is  a  labeled  bracketing,  sim¬ 
ilar  to  a  syntactic  parse  structure,  but  one  wherein 
the  constituents  labels,  predicates  and  arguments  are 
semantically- based  primitives,  rather  than  syntactic 
and  language-specific  lexical  items.  The  data  struc¬ 
tures  at  these  three  stages  are  essentially  of  the  same 
type:  sets  of  attribute-value  pairs  related  to  other 
pairs  by  means  of  indexing.  This  kind  of  structure 
allows  the  system  to  pass  on  actual  sentence  chunks, 
along  with  associated  features  of  whatever  type.  e.g.. 
morphological,  semantic,  pragmatic,  in  a  homoge¬ 
neous  format.  An  actual  example  of  the  three  in¬ 
termediate  stages  is  provided  in  figure  3.  A  detailed 
discussion  of  this  innovative  development  is  presented 
in  our  paper  for  the  AMTA  94  conference  [4]. 

3  ASR  via  HTK:  an  HMM  Soft¬ 
ware  Toolkit 

The  speech  recognition  component  of  MAVT-ADM 
is  an  HMM  toolkit.  Entropic  Research  Laboratory 
licenses  this  technology  from  the  Cambridge  Univer¬ 
sity  Technology  Transfer  Company,  and  is  responsi¬ 
ble  for  ongoing  support  of  HTK  and  future  enhance¬ 
ments.  HTK  allows  flexible  development  and  mod¬ 
ification  of  speaker  models  (e.g.,  recognizers  for  dif¬ 
ferent  languages  and  applications)  based  on  Hidden 


Markov  Model  (HMM)  principles,  for  isolated,  con¬ 
nected.  or  continuous  speech  recognition.  The  rec¬ 
ognizer  is  syntax- driven,  via  a  finite  state  grammar 
which  is  customized  for  a  particular  recognition  task. 
In  recent  ARPA  testing  of  speech  recognition  sys¬ 
tems  developed  by  ARPA  contractors  and  others,  the 
HTK-bzLsed  system  performed  comparably  with  those 
of  ARPA  contractors  on  dictation  tasks  involving  a 
5,000  word  vocabulary  and  a  20,000  word  vocabulary 
derived  from  Wall  Street  Journal  texts.  On  the  5,000 
word  task,  the  recognizer  developed  with  HTK  per¬ 
formed  at  95%  accuracy,  performing  at  87%  for  the 
complex  20,000  word  dictation  tcisk.  HTK  is  writ¬ 
ten  in  ANSI  C,  and  runs  on  Sun,  H-P.  DEC.  or  SGI 
workstations  under  Unix. 

In  the  initial  demonstration  version  of  the  MAVT 
ADM,  speaker-independent,  continuous  speech  recog¬ 
nizers  for  a  limited  mission-oriented  vocabulary  have 
been  developed  for  English,  Latin  American  Spanish. 
Arabic,  and  Russian. 

4  TrueTalW^  Text-to- Speech 
(TTS)  Softwaire 

TrueTalk^"'  is  an  advanced  software-only  TTS  sys¬ 
tem  that  converts  digitized  text  into  speech,  with  a 
word  intelligibility  rate  of  approximately  97%.  En¬ 
tropic  licenses  this  technology  from  AT&T,  where 
it  ha.s  been  in  development  over  the  past  10  years. 
TrueTaik*'^  features  a  variety  of  user  controls,  in¬ 
cluding  pitch,  word  duration,  intonation,  and  speak¬ 
ing  rate.  For  English,  TrueTalk^^  uses  a  primary  dic¬ 
tionary  of  166,000  words,  and  a  secondary  dictionary 
to  assist  in  accurate  pronunciation  of  proper  names, 
such  as  location  designations.  The  Spanish  vocab¬ 
ulary  is  of  a  comparable  size.  TrueTalk^'^  runs  on 
Sun.  H-P.  or  SGI  workstations  under  Unix. 
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Automatic  English-to-Korean  Text  Translation 
of  Naval  Operational  Reports 

Young-Suk  Lee,  Dinesh  Tummala, 

Stephanie  Seneff,  Cliff  Weinstein,  and  Jack  Lynch 

The  automatic  English-to_korean  text  translation  project  in  our  group  is  based  on  the 
natural  language  understanding  system  TINA  (S.  Seneff,  1992)  and  the  generation  system 
GENESIS  (J.  Glass,  J.  Polifroni,  and  S.  Seneff,  1994),  which  were  developed  under  ARPA 
sponsorship  by  the  Spoken  Language  Systems  Group  at  the  MIT  Laboratory  of  Computer 
Science.  The  overall  goal  of  the  project  is  to  produce  machine  translation  of  both  text  and 
speech  for  enhanced  multilingual  and  multinational  operations.  This  project  has  its  origins  in 
the  CCLINC  translation  system  (Tummala  et  al  1993).  CCLINC  is  an  automatic  speech-to- 
speech  translation  system  for  limited-domain  multilingual  applications  including  English, 
French  and  Korean. 

The  MUC-II  data,  our  source  language  data,  consists  of  105  naval  messages,  which 
feature  incidents  involving  different  platforms  such  as  aircraft,  surface  ships,  submarines,  and 
land  targets.  The  data  contain  linguistically  challenging  features  such  as  numerous  instances 
of  coordination,  complex  sentences,  multiple  modifiers,  and  compound  nouns.  At  the  same 
time,  the  data  have  typical  characteristics  of  free  texts  including  ellipsis  and  misspelling.  We 
have  translated  206  sentences  (out  of  643  sentences),  and  built  up  an  English/Korean  bilingual 
lexicon  containing  432  vocabulary  items,  which  is  easily  reusable  by  other  systems  (including 
PC-based  ones). 

The  system  demonstrated  rans  on  a  SPARC  10  workstation.  The  Korean  translation 
outputs  are  displayed  on  a  ’hangul’  window  running  on  UNIX,  and  the  Korean  inputs  are 
typ^  in  ’hangul’  emacs,  a  version  of  emacs  customized  to  support  Korean  alphabets. 
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Abstract 

This  p&per  describes  CCLINC,  a  systeza  axchitcctare  ajid 
concept  demonstration  for  automatic  speech- to-specch  trans- 
lation  for  limited-domain  multilingual  applications.  The  pri¬ 
mary  target  application  is  the  coalition  battle  management 
environment.  CCLINC  utilizes  a  Common  Coalition  Lan¬ 
guage  (CCL)  as  a  military  interlingua.  CCLINC  Is  a  speaker- 
independent  system  which  translates  spoken  utterances  in 
English  into  French  or  Korean.  The  current  system  has  a 
vocabulary  of  around  700  words.  The  system  architecture 
for  CCLINC  consists  of  a  modular,  multilingual  structure 
including  speech  recognition,  language  understanding,  lan¬ 
guage  generation,  and  speech  synthesis  In  each  language.  A 
key  new  feature  of  the  system  is  the  tight  coupling  of  the 
speech  recognition  and  language  understanding  modules.  We 
summarize  the  architectures  of  the  component  systems  and 
the  interfaces  between  them,  and  present  our  preiiminar>' 
performance  results. 

1.  Introduction 

This  paper  describes  a  system  architecture  and  concept 
demonstration  for  automatic  speech- to-speech  translation  for 
limited-domain  multilingual  applications.  (Other  specch- 
to-speech  translation  systems  are  described  in  [9,  10,  13].) 
The  primary  target  application  is  enhanced  communicaxion 
among  military  forces  in  a  multilingual  coadition  environ¬ 
ment,  where  the  translation  utilizes  a  Common  Coalition 
Language  as  a  military  interlingua.  This  interkngua  is  de¬ 
signed  to  allow  representation  of  the  meanings  of  the  limited- 
domain  communications  among  forces  in  a  common  format 
for  transmission. 

The  system  architecture  (see  Figure  1)  for  CCLINC  con¬ 
sists  of  a  modulax,  multilingual  structure  including  speech 
recognition,  language  understanding,  language  generation, 
and  speech  synthesis  in  each  language.  The  me^ng  repre¬ 
sentation  is  in  the  form  of  a  semantic  frame,  which  is  trans¬ 
mitted  over  the  Common  Coalition  Language  network.  The 
system  design  provides  for  verification  of  the  system's  un¬ 
derstanding  of  each  utterance  to  the  originator,  in  a  para¬ 
phrase  in  the  originator's  language,  before  transmission  on 
the  coalition  network.  Successful  system  operation  depends 
on  the  ability  to  define  a  suffidently  constrained,  but  useful, 

^This  work  was  spox3Sored  by  the  Advanced  Research  Projects 
.Agency.  The  views  expressed  are  those  oi  the  authors  and  do  not 
reflect  the  officiai  policy  or  position  of  the  U.S.  Govennnent. 

^Spoken  Language  Systems  Group.  Laboratory  for  Computer 
Science.  Massachusetts  Institute  of  Technology,  Cambridge.  MA 
02139. 

^Now  with  Dragon  Systems  Inc,,  320  .N'evadaSt.,  .N’ewton,  MA. 
02160. 


vocabulary  and  grammar,  so  that  a  high  percentage  of  input 
sentences  can  be  successfully  understood.  This  understand¬ 
ing  would  also  provide  the  opportunity  to  carry  out  update 
and  query  of  command  and  control  databases  via  CCL,  along 
with  the  translation  for  human  communication. 

The  rest  of  the  paper  is  organized  as  follows.  First,  we 
describe  CCLINC,  paying  particular  attention  to  the  speech 
recognition  and  natural  language  components  as  well  as  the 
interface  between  these  components.  Then  we  describe  the 
training  and  present  and  evaluate  the  results  of  our  prelimi¬ 
nary  experiments.  This  is  followed  by  a  discussion  of  lessons 
learned.  Finally,  we  give  our  future  plans. 

2.  System  Description 

2.1  Overview 

The  preliminary  implementation  of  the  CCLINC  system  uses 
a  version  of  the  Lincoln  stack-decoder- based  HMM  system 
for  continuous  speech  recognition[7,  8],  in  conjunction  with 
language  understanding  (TINA)[1, 11, 15]  and  language  gen¬ 
eration  (GENESIS)[2]  systems  which  have  been  ported  from 
the  Spoken  Language  Systems  Group  at  the  MIT  Labora¬ 
tory  for  Computer  Science.  The  vocabulary,  grammar,  and 
semantics  are  based  on  a  coalition  brigade  task  and  are  de¬ 
fined  based  on  consultation  with  Army  personnel  and  oth¬ 
ers  familiar  with  brigade  communications,  a  specification  of 
command  and  control  message  formats,  and  a  limited  set  of 
transcribed  brigade  exercise  communications.  For  instance, 
the  system  has  knowledge  of  basic  Army  radio- telephone 
vocabulary  (e.g.,  roger,  break,  etc.),  .Army  radio-telephone 
protocols  (e.g.,  user  identification),  and  basic  military  terms 
(e.g.,  weapons  as  well  as  terms  such  as  TOC  [tactical  opera¬ 
tion  center]  and  FLOT  [forward  line  of  troops]).  The  current 
working  vocabulary  is  692  words*  and  the  domain  includes 
253  semantic  categories  in  the  brigade  communications  do¬ 
main. 

CCLINC  currently  handles  many  sentences  of  moderate 
linguistic  complexity.  In  particular,  CCLINC  understands 
both  the  active  and  passive  voice  and  numerous  verb  forms 
(e.g.,  present  tense,  past  participle,  present  participle,  and 
imperative).  The  current  system  deals  with  three  languages. 
English,  Korean,  and  French.  It  accepts  English  specch/text 
input  only,  and  translates  via  CCL  to  Korean  (Hangul)  or 
French  text.  We  are  using  a  commerdal  text- to-speech  sys¬ 
tem  on  the  English  paraphrases  which  are  produced  based 
on  the  semantic  understanding.  VVe  have  recently  obtained 
but  not  yet  integrated  a  Korean  text-to-speech  synthesizer. 


^Although  ail  versions  of  CCLINC  recognize  692  words,  some 
versions  do  noc  have  any  meaningful  training  data  for  171  of  these 
words.  VVe  will  have  more  to  say  about  this  in  section  3.1. 
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Figure  i;  System  structure  for  muitiiinguai  speech-to-speech  transiation. 


We  do  iuve  as  yet  a  Freaca  speech  syathesizec.  Ftgaxe  2 
shows  aa  overview  of  CCLINC. 

2.2  Speech  Recognition 

The  preiixaiaarT  CCLDfC  system  uses  Lincoln's  large- 
vocabulary  stacic-decoder-based  HMM  in  conjunction  with 
a  set  of  speaker-independent,  ttigram  acoustw  modci^4,  8] 
and  an  augmented  Carnegie  Mellon  Prononnong  Dictionary 
for  speech  recognitioa. 


2.3  SR/NL  Integration 

The  iategradoa  of  ccatinuous  speech  recogni^n  (CSR)  and 
natural  language  (NL)  modeis  has  been  an  important  part 
of  this  effort.  We  have  implemented  a  new,  dghtiy-coupied 
approach  in  which  the  TINA  language  model  is  integrated 
directly  into  the  stack-based  scarch(3j.  For  comparison^  we 
have  also  implemented  the  type  of  deranpled  approach  in 
more  general  use  in  the  ARPA  community,  where  the  1-best 
or  N-best  CSR  pipes  its  output  into  the  luguage  under¬ 
standing  modnle.^  Thus,  the  recognizer  runs  in  two  different 
modes:  a  decoupled  mode  and  a  dghtiy-conpled  mode,  here¬ 
after  referred  to  as  TINA-LM.  In  the  deconpled  niodep  the 
recognizer  is  supported  by  a  statistical  language  model;  we 
have  run  experiments  with  a  data-driven  bigram  backoff  lan¬ 
guage  model,  a  data-driTen  txigram  backoff  language  nu^l. 
and  a  A-gencrated  bigram  backoff  language  modeL  The 
TINA-gencralcd  bigram  is  created  by  expanding  TTNA’s 
rules  exhaustively  to  the  terminals,  mnltipiyug  out  condi¬ 
tional  probabilities  along  the  way.  In  the  dghtiy-conpled 
mode,  TINA  provides  the  sole  linguistic  support  for  the  rec¬ 
ognizer,  proposing  probabilities  for  each  next  word  that  is 
allowed  by  the  grammar. 


^  A(  the  uzne,  we  only  mi  a  1-beait  CSR. 


2.4  Madbine  (Text)  Transiatxon 

The  current  CCLINC  system  uses  TINA  and  GENESIS  as 
iu  NL  component  (Le.,  to  perform  machine  or  text  trans- 
iatian).  .Machine  translation  systems  vary  along  two  major 
dimensions:  basic  approach  (Le.,  operation  by  statistic^  vs. 
symboiic/Enguistic  means)  and  depth  of  analysis  (Le..  direct 
replacement,  transfer,  or  interlingual) [7],  TINA/ GEN^IS 
is  classified  as  a  symboiic/linguisdc,  interlingual  machine 
translation  system  within  this  framework.*  TINA  is  based 
on  a  conte.xt-free  grammar  augmented  with  syntaede  and 
semantic  features{I,  11,  IS).  The  parser,  with  the  aid  of  a 
morphological  analyzer,  produces  a  parse  tree  representadon 
of  the  input  sentence.  TTiis  parse  tree  is  then  mapped  to  a 
semantic  frame,  which  is  the  starting  point  for  the  language 
generation  tnoduie,  GENESIS. 

GENESIS  produces  a  paraphrase  in  the  target  iaaguage 
from  the  semantic  fram^2].  The  semantic  frame  is  in¬ 
tended  to  capture  the  meaning  of  an  utterance  in  a  way 
that  preserves  the  hierarchical  dependencies  in  the  utter¬ 
ance.  Language  generation  is  effected  by  the  interacdon 
of  the  ianguage-indepeudent.  GENESIS  engine  with  three 
langnage-spedfic  modules.  These  modules  axe  a  lezicou.  a 
set  of  message  templates,  and  a  set  of  rewrite  rules.  Tne 
role  of  the  lexicon  is  to  specify  the  surface  form  of  a  se- 
frame  entry,  including  the  constmedon  of  inffectioual 
endings.  The  catalog  of  message  templates  determines  the 
ordering  of  consdtuents  in  a  sentence.  The  third  modoie. 
the  rewrite  rules,  captures  phonotaede  constraints  and  con¬ 
tractions.  For  instance,  in  French,  “de  Ics""  is  realized  as 
“des." 

Figure  3  and  Figure  4  show  the  parse  tree,  semandc  frame, 
and  paraphrases  produced  by  CCLINC  for  the  sample  sen¬ 
tence,  **Reqnest  permission  to  defend  hilltop  echo.'*  One 


^Although  TINA'S  rules  are  escered  mannally,  TINA  inchidri 
a  probabilistic  framework,  along  witb  an  aatomade  training 
capability. 
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Figure  2:  Process  flow  of  CCLINC. 


poxnt  to  note  in  Figure  3  is  the  presence  of  syntactic  cate- 
goncs  near  the  root  of  the  tree  (i.e.,  statement,  predicate, 
infinitive,  etc.)  and  semantic  categones  near  the  leaves  of 
the  tree  (i.e.,  fortify,  theJocation,  etc.).  Also  note  that  the 
sentence  which  is  translated  in  Figure  3  and  in  Figure  4  is  a 
statement.  A  sentence  in  the  coalition  brigade  domain  is  ei¬ 
ther  a  statement,  command,  callup  (i.e..  a  sentence  in  which 
a  user  identifies  himself),  or  reply  (i.e..  a  subjeciiess  phrase 
which  may  include,  among  other  things,  an  opening  remark 
such  as  ‘^roger’’,  a  command  and  control  message  such  as 
“sitiep,"’  and/or  a  closing  remark  such  as  “'over’). 

.An  English  paraphrase  of  the  sample  sentence  as  wed  as 
translations  in  French  and  Korean  appear  in  Figure  4.  Note 
that  the  English  paraphrase  differs  flom  the  input  sentence 
in  two  ways.  First,  we  have  inserted  the  subject  “we.”  The 
input  sentence  does  not  contain  an  e.xpiicit  subject.  The  im¬ 
plicit  subject  is  “I”  or  '"we.”  We  arbitarily  chose  the  plural 
"we"  rather  than  the  singular  “I”  as  the  subject.  The  sec¬ 
ond  way  in  which  the  input  sentence  differs  from  its  English 
paraphrase  is  in  its  choice  of  infinitive.  The  input  sentence 
uses  the  word  “defend”  whereas  the  English  paraphrase  uses 
the  word  “fortify.”  The  reason  for  this  difference  is  that 
CCLINC  generalizes  the  verb  “defend.”  In  fact,  the  verbs 
“defend.”  “fortify  ”  and  “strengthen”  are  all  mapped  tp  the 
same  semantic  category  -  the  /ortifV  category.  The  idea  is  to 
reduce  the  number  of  semantic  objects  known  to  the  system 
(i.e.,  the  number  of  lexical  entries,  the  number  of  message 
templates,  etc.)  without  losing  meaning. 


2.5  Text-to-Speech  Synthesis 

We  have  recently  obtained,  but  not  yet  integrated,  the  Ko¬ 
rean  text-tospecch  synthesizer  “Says.^  “Says”  is  a  product 
of  Digicom.  We  do  not  have,  as  yet.  a  rrench  speech  synthe¬ 
sizer.  On  our  English  paraphrases,  we  are  using  a  synthesizer 
developed  by  Eloquent  Technology,  Inc. 


3.  Training  and  Evaluation 

3.1  Training 

We  are  currently  using  the  transcription  of  a  T^k  Force 
Command  Net  exercise  as  the  main  source  of  training  and 
test  data.  The  data  contain  1400  transcribed  utterances 
which  we  have  divided  into  two  training  sets  of  approxi¬ 
mately  500  sentences  each  and  two  test  sets  of  approximately 
200  sentences  each.  For  the  experiments  reported  here,  we 
make  use  of  only  one  of  the  training  sets  and  only  one  of  the 
test  sets.  In  adition,  we  had  generated  33  sentences  within 
the  domain  as  an  initial  data  set,  giving  us  a  total  of  530 

training  sentences.  . 

I'he  bigram  and  trigram  language  models  were  trained 
from  these  530  sentences  usmg  standard  techniques.  TINA's 
rules  were  developed  by  hano.  baseo  ou  observed  patterns  in 
these  sentences.  TIN  As  probabilities  were  trained  automat- 
icaiiv  by  parsing  each  training  sentence  and  updating  appro¬ 
priate  counts.  It  should  be  noted  that  TINA  can  only 
and  understand  321  of  the  530  training  sentences  (60.6%).' 
The  only  knowledge  TINA  has  of  the  other  209  sentences  is 
of  the  existence  of  the  individual  words  in  these  sentences. 
There  are  171  words  which  appear  m  thc»e  209  sentences 
that  do  not  appear  in  the  rest  of  ihe  training  data.  Hence, 
the  TINA  laneuace  model  and.  by  inference,  the  TINA-LM 
system  and  the  TIN  A- generated  Digram  have  no  meaningful 
training  data  for  171  of  CCLINC  s  692  words. 

3.2  Evaluation 

We  have  run  very  preliminary  experiments  to  obtain  initial 
benchmarks  on  the  performance  of  the  system  and  its  compo¬ 
nents.  In  particular,  we  will  report  separate  results  on  speech 
recognition,  text  understanding,  and  speech  understanding. 
In  ail  cases,  we  will  be  using  as  the  test  data  one  of  the  un¬ 
seen  sets  mentioned  above,  a  set  of  190  sentences.  For  speech 
recognition,  we  report  for  three  separate  experimental  con¬ 
ditions  (i.e..  distinct  language  models):  data-driven  bigram, 

■We  have  not  yet  implemented  a  robust  parsing  capability, 
•which  would  greatly  extend  TINA’s  coverage. 
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Figure  3:  Parse  tree  for  a  sample  sentence. 


Xxzpue:  Hec^eee  pedii.aal.oa  to  defend.  hiXitop  echo 

Semaaeic  Frame  (Consaca  CoaXiCion  X»anguaq‘e)  : 

{c  strarement 
:aode  'fpi* 
ratanber  "epi* 

:pred  {p  v^requcat 

:ropa.c  {q  perniiaalon 

:caaxplemenc  {p  fomify 

:aux  •to*  . 

:topa.c  {q  hilltop 
:pred  {p  iaitiala 

rtoplc  -echo-  }}}}}} 

Eaqliah  Paraphraae:  We  request  persxiaaioa  to  fortify  hilltop  echo 

pjfeach  Paraphrase:  Houa  la  pecaisaioa  da  fortifier  le  scssiet  echo 

Kortiaa  Paraphraae:  i^Ch 


Figure  4:  Tie  semantic  frame  and  parapiirases  for  a  sample  sentence. 
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data-driven  trigraxn,  and  TINA-LM.  The  performance  is 
evaluated  based  on  insertion,  deletion,  and  substitution  er¬ 
ror  rates  as  well  as  word  and  sentence  error  rates. 

For  speech  understanding,  we  also  report  on  the  same 
three  conditions.  In  this  case,  it  is  more  difficult  to  measure 
performance.  We  decided  to  adopt  the  evaluation  methodol¬ 
ogy  proposed  bv  White  and  O’Connell  (i.e..  fluency  and  ad^ 
quaev  criteria)(12].  The  fluency  and  adequacy  of  the  french 
and  Korean  translations  were  evaluated  by  native 
of  those  languages.  Text  understanding  was  evaluated  in  the 
same  way  except  that,  in  this  case,  we  had  only  one  systm. 

Table  1  shows  the  speech  recognition  results  as  a  funcuon 
of  language  model  Note  that  the  sentence  error  rates  are 
approximately  50%  for  each  of  the  recognizers.  Th^  error 
rates  are  higher  than  expected.  We  would  expect  lower  er¬ 
ror  rates  if  we  had  used  task-specific  acoustic  modeb  and/or 
had  more  training  data.  As  expected,  the  sentence  error 
raU  for  the  data-^ven  trigram  recognizer  is  slightly  lower 
than  the  sentence  error  rate  for  the  daU-driven  “JS^am  tK- 
ognizer.  However,  the  sentence  error  rate  for  the  lliNA-LM 
recognizer  b  higher  than  that  of  either  of  the  data-dnven 
n-gram  recognizers.  TINA-LM  gives  a  very  high  deletion  er¬ 
ror  rate  which  is  due  in  large  part  to  the  near  100%  deleUon 
incurred  for  failed  sentences.  We  show  iatcr  in  this  section 
that,  despite  higher  speech  recognition  sentence  error  rates, 
the  TINA-LM  system  produces  '‘better'’  translations  than 
do  either  of  the  other  speech-to-spcech  traasiation  sptems. 

The  text  and  speech  understanding  results  are  shown  in 
Table  2.  The  second  column  of  Table  2  indicates  the  nimber 
of  test  sentences  that  each  system  parees  (i.e.,  the  number  of 
test  sentences  for  which  the  system  in  question  produces  a 
parse  tree,  semantic  frame,  and  paraphrases).  The  remaining 
columns  of  the  table  show  the  fluency  and  adequacy  scores 
of  the  French  and  Korean  translations,  where  1  is  the  lowest 
score  and  5  is  the  highest  score.  The 

that  the  text  translation  system  parses  52.1%  of  the  lyu  tes 
sentences.  This  is  a  particularly  good  result,  considering  that 
TINA  only  parses  57.9%  (288/497)  of  the  traini^  sentences 
taken  from  the  military  exercise  transcription.  The  conclu¬ 
sion  is  that  we  have  covered  part  of  the  coalition  bngaOe 
domain  quite  well.  The  second  point  to  note  is  that  the 
text  translation  system  outperforms  the  two  data-dnven  ^ 
gram  systems,  both  in  terms  of  number  of  sentences  p^sed 
and  number  of  fluent  and  adequate  parses.  This  result  is. 
of  course,  expected  since  the  data-dnven  n-gram  recogniz¬ 
ers  have  high  error  rates.  Another  point  to  note  is  that 
the  data-driven  irigram  system  does  slightly  better  than  the 
data-driven  bigram  system.  This  is  also  an  expectea  resmt. 
Table  2  also  shows  that  the  TINA-LM  system  definitely  out- 
performs  the  two  data-driven  n-gram  systems.  (Note  the 
number  of  fluent  and,  in  particular,  adequate  parses  tor  the 
three  systems  in  question.)  In  addition,  the  TINA-LiVl  sys¬ 
tem  performs  nearly  as  well  as  the  text  translation  system. 
The  TINA-LM  French  system  produces  ten  fewer  adequate 
parses  than  docs  its  text  translation  counterpart  and  tiie 
TINA-LM  Korean  system  produces  only  one  fewer  adequate 
parse  than  its  text  translation  counterpart.  Furthermore, 
the  TINA-LM  system  parses  many  more  sentences  (146  to 
99)  than  does  the  text  translation  system.  We  will  discuss 
this  result  as  well  as  the  general  performance  and  merits  of 
the  TINA-LM  system  in  the  next  section. 

There  are  a  number  of  important  caveats  to  the  aoove 
experiments.  The  first  and  most  important  caveat  is  that 


®Thc  TINA-generated  bigram  was  not  evaluated  because  we 
arc  not  confident  tnat  it  is  bug-free. 

®  A  fluent  parse  is  a  sentence  which  is  parsed  by  the  appropnate 
system  and  whose  system  translation  is  given  a  fluen<p^  score  ol  ac 
three.  An  adequate  parse  is  defined  analogously. 


CCLINC.  ind  therefore  any  evaluation  of  it,  is  still  in  a 
preiirainarv  stage.  The  second  caveat  is  that,  as  previously 
mentioned,  we  only  ran  a  1-besl  CSR  in  our  decoupled  sys¬ 
tems.  We  would  expect  the  performance  of  the  n-gram  sys- 
terns  to  improve  with  the  use  of  ,V-best  CSRs.  Finally, 
TINA'S  parse  coverage  on  both  the  training  and  test  sets 
would  improve  substantiaUy  if  we  added  a  robust  parsing 
capability,  although  the  paraphrase  quality  would  prooaDly 
degrade  for  robust  analyses. 


4.  Discussion 

In  thb  section,  we  shall  discuss  the  merits  of  the  tightly- 
coupled  approach,  the  portability  of  CCLINC  to  lan¬ 
guages,  and  the  applicability  of  speech  transiaUon  technol¬ 
ogy  to  the  coalition  brigade  domain. 

We  believe  that  the  TINA-LM  system  has  numerous 
strengths.  First,  the  system  directly  incorporates  a  natu¬ 
ral  language  model  into  the  primary  search  process  ot  the 
recognizer.  NL  constraints  are  applied  immediately  m  a 
left-to-right  pass  through  the  sentence,  thereby  coeremg  the 
system  to  produce  only  grammatical  recognizer  outputs.* 
Thus.  TIN.A-LM  often  produces  a  parseable  recognition  out¬ 
put  even  when  the  output  is  not  correct  (i.e..  when  there  is 
at  least  one  word  error  in  the  recognition  output).  Spean- 
cally  the  TINA-LM  system  produces  incorrect  but  parseable 
recognition  outputs  for  62  of  the  190  test  sentences.  In  con¬ 
trast.  the  daU-driven  bigram  system  produces  incorrect  but 
parseable  recognition  outputs  for  only  four  of  the 
tences.  It  is  these  numbers  which  explain  how  the  TINA- 
LM  system  produces  “better”  translations  than  do  the  n- 
cram  systems  despite  higher  recognition  error  rates.  These 
numbers  also  explain  how  the  TINA-LM  system  pars«  more 
sentences  than  does  the  text  translation  system.  In  pw- 
ticular.  the  TINA-LM  recognizer  transforms  50  unpi^able 
sentences  into  parseable  sentences.  In  other  words,  of  the  62 
test  sentences  for  which  the  TINA-LM  recognizer  produces 
an  incorrect  but  parseable  output,  only  twelve  can  be  pw^ 
by  the  text  translation  system.  The  second  strength  of  the 
TINA-LM  system  is  that  it  enforces  long-distance  language 
constraints  that  n-gram  language  model-based  systems  can 
not  For  instance.ihe  TINA-LxM  system  correctly  recognizes 
the  sentence  “Roger  I  got  it."  In  contrast,  the  daU-unven 
bieram  svstem  produces  “Roger  I  got  a"  for  the  same  sen¬ 
tence.  The  ontput  “Roger  I  got  a”  does  not  satisiy  the  lodow- 
ing  long-distance,  ordering  constraint:  ’...  subject  wro 
ject  end-oLsentence."  The  third  advantage  of  the  TIN.-x-LM 
system  is  that  it  uses  a  meaning- based  generalization  mecha¬ 
nism  rather  than  the  experience-based  generalization  mecha¬ 
nism  that  n-gram  language  models  use.  Meaning-baseu  gen¬ 
eralization  is  particularly  important  when  data  are  sparse, 
as  in  out  current  situation. 

One  advantage  of  interlingual  systems  such  as  CCLINC 
is  that  they  are,  at  least  in  theory,  readily  ponaoie  to  new 
languages.  In  practice,  we  found  this  statement  to  be  reason¬ 
ably  true.  The  use  of  a  CCL  made  extension  to  French  sipif- 
icantlv  more  straightforward  since  English  and  French  share 
numerous  characteristics.  .An  example  of  a  feature  which 
we  needed  to  add  to  the  CCL  to  extend  CCLINC  to  Frencn 
is  the  ability  to  distinguish  between  direct  and  indirect  ob¬ 
jects  and  direct  and  indirect  object  pronouns.  In  English, 
both  objects  and  object  pronouns  follow  the  verb  whereas 


Theoretically,  the  TINA-LM  recognizer  should  produce  a 
CTammaucal  output  for  each  sentence.  However,  it  may  produce 
So  output  if  there  is  no  sentence  hypottois  with  tne  tm^um 
rustic, 'linguistic  score.  In  fact,  the  TINA-LM  sys^m  did  not 
produce  parses  for  44  of  the  190  test  sentences.  (See  table  2.i 
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Table  1:  Speech  Recognition  as  a  Function  of  Language  Model 
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Table  2:  Text  and  Speech  Translation  Results 


in  French,  direct  and  indirect  objects  follow  the  verb,  but 
direct  and  indirect  object  pronouns  precede  the  verb. 

The  use  of  a  CCL  made  extension  to  Korean  somewhat 
easier.  We  did  not  need  to  capture  “rank"  information  in^e 
CCL  because  CCLINC  assumes  one  mode  of  speakmg.  (One 
big  difference  between  English  and  Korean  is  that  KoreM 
has  different  verb  endings  depending  on  the  ranks  of  the 
speaker  and  the  listener.  CCLINC  emulates  the  speech  that 
an  educated  military  person  of  middle  rank  would  use  to  his 


.  V  Ur*  t 

Finally,  we  would  like  to  conuneut  on  the  applicability  ot 
speech  translation  technology  to  the  coalition  brigade  do¬ 
main.  In  other  words,  we  arc  interested  in  how  easy  it  is  to 
automatically  translate  “military'’’  sentences  as  compared  to 
sentences  in  other  domains.  On  the  one  hand,  as  mn^  as 
40%  of  our  data  involves  nothing  more  than  user  or  grid  idcn- 
tiheation  or  other  basic  Army  protocols.  On  the  other  hand, 
‘‘militarcse”  is  more  ungrammadc^  and  colloquial  than  is 
typical  speech.  Furthermore,  it  is  difficult  to  find  translators 
and  evaluators  with  military  knowledge,  both  of  which  are 
needed  in  the  development  of  CCLINC. 


5.  Future  Plans 

Based  on  our  initial  results  and  an  assessment  of  user  needs 
in  Korea,  we  expect  that  the  focus  of  our  work  in  the  near 
future  will  be  on  language  modeling  and  nndcrstandmg  of 
real  message  traffic,  which  will  serve  as  a  basis  for  application 
to  both  text  and  speech  translation. 
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Forward  Area  Language  Converter 
Mr.  Daniel  W.  Smith,  Jr. 


Initial  prototype  system  will  demonstrate  translation  of  2-3  languages. 

Final  System  will  include  language  translation  capabilities  to  support  XVIII  Airborne 
Corps  contingencies. 

System  user-friendly  utilizing  a  Graphical  User  Interface  (GUI). 


Final  version  of  system  software  will  step  the  soldier  through  the  d^ument  seining 
procedure.  Once  document  is  scanned,  the  soldier  will  essentially  "press  a  key  and 
initiate  an  automatic  OCR/translation  procedure  of  the  scanned  information  followed 
by  transmission  over  a  SINCGARS  radio  or  the  MSE  digital  communications  systems 
Custom  integration  software  will  take  care  of  all  the  necessary  calls  to  the  program, 
file  generation,  execution,  etc.,  this  procedure  will  be  transparent  to  the  user. 

Contact:  Mr.  Daniel  W.  Smith,  Jr. 

Science  Advisor 

CDR  XVin  Airborne  Corps 

ATTN:  AFZA-CS-S 

Ft.  Brasg,  NC  28307-5000 

(910)  396-3780;  FAX:  (910)  396-8215 
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prepared  by: 

PerrI  Nejib,  Asst.  Science 
Adviser,  XVIII  Airborne  Corps, 
910-396-3780 
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Multimedia  Medical  Language  Translator 

HMC(AW)  Michael  D.  Hesslink 
Captain  Michael  Valdez 


The  Multimedia  Medical  Language  Translator  (MLT)  uses  a  laptop  computer  to  help 
medial  examiner  communicate  with  patients.  The  system  enables  a  health-care  provider  to 
ask  a  series  of  standard  examination  questions,  and  to  convey  simple  words  of  greeting  and 
explanation,  in  a  patient’s  native  tongue.  This  contact  can  make  all  the  difference  in  keeping 
the  patient  calm  and  in  getting  the  information  necessary  to  prompt,  effective  treatment. 

Developed  by  Commander  Lee  Morin  of  the  U.S.  Navy  Medical  Corps,  MLT  was  first 
used  by  U.S.  Navy  health=care  staff  of  Fleet  Hospital  Zagreb,  while  supporting  U.N. 
peacekeeping  forces  in  the  former  Yugoslavia.  The  hospital  is  responsible  for  the  health  care 
of  40,000  U.N.  personnel  ftom  35  nations. 

Distributed  as  a  CD-ROM  disk,  the  program  is  applicable  to  any  type  of  health-care 
environment.  It  promises  to  be  especially  valuable  i  crises—such  as  natural  disasters  or 
political  conflicts,  or  in  emergency  rooms  of  metropolitan  hospitals  —  where  rapid  response  is 
needed  and  interpreters  may  not  be  readily  available. 

The  current  version  of  MLT  can  be  used  by  anyone  literate  in  English,  Russian,  or 
Chinese.  He  or  she  can  point  to  a  series  of  phrases  from  a  list  of  nearly  2,000  or  select  one  of 
more  than  40  "scripts"  for  various  topics  and  specialties,  from  dentistry  to  gynecology.  The 
device  then  "speaks"  the  phrases  or  script  in  the  voice  of  a  native  speaker  form  one  of  several 
dozen  languages.  One  script  cycles  through  all  available  languages,  asking  the  patient.  Do 
you  speak...?"  The  medical  worker  can  also  use  the  computer’s  search  function  to  instantly 
find  desired  words  or  phrases. 

Written  in  state-of-the-an  Visual  Basic  running  under  Microsoft  Windows,  the  MLT 
program  is  compact  and  can  function  on  a  basic  machine  with  4  megabytes  of  RAM  and  a 
single-speed  CD  player.  The  device  can  be  customized  to  each  user. 

Contact:  HMC(AW)  Michael  D.  Hesslink 

Naval  Aerospace  and  Operational  Medical  Institute 
ATTN:  Code  05 
220  Hovey  Rd. 

Pensacola,  FL  32508-1047 

(904)  452-8212;  FAX:  (904)  452-3404 
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e-mail:  Klipple@cs.umd.edu 
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SFC  Thomas  L.  Rosenbarger 
U.S.  Araiy  5th  Special  Forces  Group  (A 
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Pittsburgh,  PA  15213 

(412)  268-7676;  FAX:  (412)  268-5578 
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